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ABSTRACT 

This work «3q>lores the feeelbllity of revritiag steadord imiyroeeesor 
programs into code which contains no context-dependent branches. That is, 
*his type of code (context independent code) would contain no branches that 
might require different processing elements to branch different ways. 

In order to investigate the possibilities and restrictions of CIC, 
several programs were recoded into CIC and a four-element array processor 
was built. This processor (the St^er-bS) consisted of three 6502 microproc- 
essors and the Apple II microcomputer. The results obtained were somewhat 
dependent upon the specific architecture of the Super-65 but within bounds, 
the throughput of the array processor was found to increase linearly with 
the number of processing elonents (PEs). The slope of throughput versus PEs 
is highly dependent on the program and varied from 0.33 to 1.00 for the 
sample programs. 
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1. imioinGrxoi 

The need! jt most eoapiit«r uMti «r« eoutoitly ehttsiag. ThttM 
ne«dt trad to dcnumd foottr and aoro powerful eowputere ee the oeer puts 
the conputer to wore exteneive oee* Footer eoaputero wey be obtoined 
either by improving the row speed of the circuits sad eoaponeats or by 
using the some circuits in o more ficient ordiitecture* Unliotited im- 
provasents in circuit speed cannot be elected due to fundomentol physical 
constants* the aost noteble of these being the speed of light. Therefore* 
new approaches to computer orgoniaation must be found if projected demands 
of computer users are to be swt* particularly in the area of large scien- 
tific problems. 

In recent years* much attention has been given to unconventional 
organisations and various super* computers utilising new concepts have been 
built [Slotniffk, 1967]. An endless amount of questions and discussions is 
possible when the capabilities and handicaps of different organisations are 
compared. One can often find a specific application for which a given 
architecture excels as well as instances in which the same approach is 
ineffective. It is not the purpose of this work to make exhaustive compar* 
isons of the capabilities and handicaps of different architectures. Only 
one particular organisation will be dealt with: the array processor. 

The array processor has been widely accepted by the computer community 
as a cost'effective approach in s particular but rather important set of 
applications CThidrber and Wald, 19753. In this form of processor* high 
throughput is achieved by introducing parallelism* that is to say* several 
processors performing Marly identical operations. In this work* the array 
architecture is exsmiMd and a new approach to the design of an array 
processor is proposed in order to take advantage of the recent advent of 
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low-cost* high-perfonwneo nicroproeossors* 

1.1 What ia an Pponaaeorf 

lllUe IV will be taken here es the eoneentionel errey processor* 

This section is not neant to be e conplete description of llliec IV end 
cone femilierity with the work of Barma at aU [1968] end of Kuok [1968] 
is essuaed. Only e few beeic concepts ere considered here in order to set 
the stage for the discoasion that follows. 

Figure 1.1 shows the functional diagran of a single-processor eonpu- 
ter* It consists oft (1) a mmory to hold operands and inatructona* (2) a 
control unit that fetclMS instructions frost the siemory* decodes then and 
issues control signals to (3) an arithnetic unit that perfoms the opera- 
tions on opersnds taken fron the nemory. The s»st radical approach to 
parallelissi would obviovaly be to replicate the elements shown in Figure 
1«1 a number in) tines providing adequate interconnections between the 
elements* This is the multiprocessor approach [Flynn, 1972]* Although 
powerful* this organisation leads to several implementation problems and as 
yet appears impractical for large n. 

The expense of a multiprocessor architecture is primarily a result of 
the cost of the interface connecting each of the processors to each of the 
memories and the economic burden caused by the multiplicity of control 
units* This burden can be subatantial an in a sophiatieated clasaical 
machine^ the control unit typically accounts for more than half of the 
total gate count Waahado, 1972]* 

These considerations lead one to the conventional array computer 
approach whose functionsl diagran is shown in Figure 1*2* Only the 
arithmetic units and memoriea are replicated and one aingle control unit 
(CU) driven the array of arithmetic units* Thus* an array processor is 



Figure 1.1 Slngle-processor computer 
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Flgore 1.2 Comrsational array coaputer. 
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chwr«et«riMd by tiM f«et that a aii^a inttruttioa Mtaaa ia aiaevt^ 
•iMltaaaaatlj by aaj or all of Uia aritbaatie onita. For eartaia 
aritbaatie imita» tha oparatioa aay haaa to bo aodifiad or aoayMidod baaed 
on tha eontanta of a t*ta or aaak ratiatar ia each pcoeoaaar. For thia 
raaaoot tha aatira eoaarol unit cannot ba aada eontral aa certain control 
daciaionc am oparand-^andant • Tharafora* a ■iaiami monnt of control ia 
kept local and each arithaatic unit ploa ita local control ia called a 
proeaaaiog alanant (FB)« Tha tarn procaaaing unit (FV) ia naod to daaig- 
nata a FI with ita procaaaing almant MBory (Fin)* Inatroctiona can ba 
atorad either acroaa tha FIMa or ia a apacial inatroctien aanory. 

One intarpratatioa of tha ttrap'^procaaeor concept ia that erary FI 
parfoma preciaaly tha tana inatroctloa on tha aana addraaaaa ia ita own 
FEM* Thia mnatraint cm ba relaxed aoMidiat with the introduction of 
extra hardware to allow local indexing* a»de control and routing. Thtae 
concepta aa they are connonly applied to aa array proceaaor (auch aa Xlliac 
IV) will uow be briefly deacribed. 

liOcaL_Iftitxina . Tha (HJ broadcaata an addraaa to each PE. Thia ad’ 
dreaa nay be modified by each FI. In Illiac If* for inatcnce* an index 
regiater and addraaa adder are provided with each FI. A cratral index 
regiater ia alao provided in the GO. The final operand addraaa for FB 
ia the ana of the baae addreaa apecified in the inatruction* the eontanta 
of the central index regiater in the CD* and the eontanta of the local 
index regiater of the PB . 

Hode Control . Although the goal of the array-proecaaor atructure ia 
to be able to control the proceaaing of a nm^r of data atrema with a 
aingle inatruccion atrean* it ia aonetinea neceaaary to exclude aone data 
atrema or to proceaa then differently. Thia ia acconpliahed by allowing 
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•Mh la tr oction to bo loeolly ao4ifio4 by tbo Flo. Tbo oiayloot fora of 
■odo eoDtrol io to docido loeolly if eootrol iaotraotioo Z wiU bo looolly 
oneotod oo Z or u • oo-opi i.o. ooch PI con bo tornod on or off. Thio io 
the only typo of aodo control oroiloblo in Zllioc Z?. Coaploto wdo- 
control copobility won Id obrionoly rooolt in o mltiprocoooor ^prooch. 

loottot . Rooting io d^inod oo the ratbod by idiieb Pl^ aoy obtoin on 
optrond which io otorod in PBM^ Thio nood oriooo in aoay opplico* 

tiono ond therefore oom woy of routing oporondo fron one PI to onothor io 
McooMry. The oiaploot typo of rooting io to link PI . to PI . , ond PI . , . 

t v¥\ 

Thio io colled noi^bor sooting. Mcn-noighbor sooting io thoo obtoinod by 
o ooqooeco of neighbor rootingo. Zllioc Z? oocs ss sdwoncod fora of 
neighbor sooting. This fora io ^e foor-nooreot-noi^ibor sooting, loch PI 
io oble to coawmicoto with the fonr Plo odjocont in tbo four directions 
(conventiottolly described eo north, south, eeot end west). 

1.2 Motivation for Array Proovaaora 

There ore nony reeoono for using single-instruction-strera-Bultiple- 
dotO'StrosB (8ZID) erchitectuses. Thurbar and Wald C1975] contend thot 
81MD orchitecture is useful for lerge probleos such os weother onelyois ond 
prediction, seiniic dote processing. phosed*orrsy-rodor processing ond 
picture processing. They else oontend thot probleos with inherent doU 
structure ond psrollelini such os solving systoos of lioeor equotions. 
Fourier tronsforos end systeos of portiol ditferentiel equotions con be 
successfully executed on s 81M> oochine. They further divide the odvon- 
tsges into the following cotegories. 

1. lordwore 

s. BMt«r use of herdwore on highly porollel probleos. 
b. Cost fectivenees doe to the odvent of L8Z oicroproeessors. 


e. Ov«reDa^ tiM tp««d of aoi^cootovo* 

4. lolidbllity and graeofttl do^rodotion of oyotM. 
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2* Softwora 

•• SiB^lor titan fw ■aUipn>CMi» (IIIIB)* 

b. BMiar to eonatroct larga ayatana. 

e« Lms atrict oxoeativo~f«nction ro^iranonta. 

Thaao advantagoa aro obtainad at a pricot aocb nachinoa tand to ba 
apacUl-ptffpoata and any attspt to apply to inapproptlata pooblana will 
likaly ba in oain. 

1 .3 laauaa and Objeotivea of This Stfttdy 

Aaton^ titat tiia array procaaaor ia to ba oaad in tiia ^par 
aavironnant* tbara atill ranain aaowal iaauaa to ba raaolood* aocb aa 
procmaor-ntaiory intaroomitction* intarprocaaoM connMaication» aoltwara 
raquiranantt (najor nodificationa of ttaiprocaaa«r progra raquirad* how 
contant-dopandant brancbaa ara bandlad)* axpandability of tita ayatan and 
fault tolaranca of ayatan. 

1.3.1 ProosBSOt^rwnory Intaraonneotion , Tbia atudy axainaa tba 
linitationa inpoaad by allowing oonnunicatioo fron aaeb of tba procaaaor 
alananta to tiia tiiarad nanory (nO only by way of a aingla addrMa boa and 
a aingla data boa. Tbia atroctora ia attraetiwa both aoononically and from 
tba ataadpoint of ayatan eonplasity. It ia* in faa tba ainplaat intaroon- 
naetion batwoan aawaral proeaa^a and a abarad nenory. Aa ona can inag- 
ina* tbia ai^licity inpoaaa cartain linitationa. Tbia atudy will aoak to 
datamina if tba linitationa inpoaad by tbia arcbitactura ara accaptabla. 

1.3.2 IntavproMsaor oomunioationa, Aa daacribad aarliar* noat array 
procaaaora bava aona form of intarprocaaaor oomnunication. A typical 
arraaganant ia to bava a aingla bit cbannal of com»ieation to aacb PE 


8 


iamdiately to the north* eonth* eeot and weet. Thie reeeereh oploree the 
ptobleae which reeult when the only fom of eounmicetion between two PBc 
is through the ehered nenory* This errangenent reqoirea nuch leae apecial 
hardware to he added to the baaic proeeaaing wait but inatead requirea a 
WRITE to and a READ from the ahared manory. On the other hnd* the fact 
that only one proceaaor can write to the ahared memory at a time awy ad* 
veraely affect ayetem throughput. 

1.3.3 Software requipt^nente , Becauae all proceaaor a ahare the aame 
addresa bug in an array proceaaor* each proceaaor muat execute the aame 
instruction at the same time. If the programs to be executed were strictly 
linear with no branching* this would not be a restrictions however* moat 
programs contain several branches and loops. Thus the uniprocessor progras 
is not directly executable on the array processor. In Illiac IV \paxmes et 
al,, 1968s Kuokt 19683* as in most array machines* the programs reside in 
the shared memory and are specially compiled for the array by a host proc* 
esBor. The instruction strecm seen by the individual PE is essentially u- 
niprocessor code containing loops and branches* The Illiac IV allows local 
control to determine if the branch that the array is taking should be exe- 
cuted by that PE. If not* the PE executes no-ops until the array returns 
to execute the other branch. For code containing short context-dependent 
branches* the overall system throughput is not seriously degraded* How- 
ever* if nested branches and long context-dependent branches are allowed* 
fewer and fewer PEs execute until all PEs are halted! the array then allows 
the waiting PEs to execute the next portion of code. Flynn [1972] has 
suggested that Hinsky's Conjecture (that systan throughput increases as 
log 2 n where n equals the number of PEs)* may be accurate if* on the 
average* half of the remaining PEs continue to execute after a given 
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branch. At in the cnee of Illitc I?» the problent which the array 
proceator ia deaigned to aolve do not force it into large netted branehea 
often enough to produce auch congeation. 

One alternative* which rewovea the need to halt any PB* ia to rewrite 
the original code ao that it containa no context-dependent branches. That 
ia* all context-dependent brmchea are recoded to allow the PE to execute 
the same inatructiona* but the data on which the instructions operate de- 
termine what operations are performed. This is not self-modifying code in 
the sense that the progm reaides in SM and is never altered. The preaent 
work investigates the feasibility of rewriting standard uniprocessor pro- 
grams into code which contains no ^ntext-dependent branches* hereafter 
called Context Independent Code (CIC). This investigation also considers 
the limitations such recoding may have upon general usefulness of the 
system and on system throughput. 

1.3.4 Ex^ccndability of eyetem. Another aspect to be considered is 
that of expandability of the system. Many array systems are completely 
fixed as to the number of processors contained in the system. If a user 
desires to expand this system* the only solution is to add another complete 
system. An appealing aspect of the CIC software is that it allows the 
system to be expanded one processor at a time without requiring the system 
software to be completely rewritten. This is especially true if the number 
of processors is contaned as a variable within the progrsm so that one 
simply increments the variable when a PE is added to the system. Also one 
is guaranteed that the system throughput increases linearly with n. This 
obviously contradicts Minsky's Conjecture* which advocates of array proces- 
sing have been attempting to disprove for some time. 


1.3.5 Occult tolevanoo of eye tm, Om final itsiin it that it fault- 
tolerance of the tyetea. If one faulty yroceetor eaotee the entire array 
to fail* the array will hare a much higher failure rate thn any one of the 
PBt. Thit neant that if the eytten containt 1000 PBtt each with a failtnre 
rate of approximately .OIZ* the tyatea hat an unacceptable failure rate of 
lOX. However* if one it able to decouple the PBt to the extent that no one 
PB directly affecte any other PB* then the failure rate ia draetically re- 
duced. Hore important than fault tolerance ia fault detection. That ia* 
one muat be able to determine if the reaulta the array ia generating are 
valid. The concept of 8IMD proceaaing combined with CIC haa the apecial 
feature that the addreaa linea of all PBa muat be the aame at all timea 
(other than when local indexing occura)t any departure by one PE ia a 
certain indication of error. 

The major objective of thia theaia ia to demonatrate that CIC recoding 
ia feaaible and attractive for aome applicationa. In order to puraue thia 
objective* a four-proceaaor array computer waa ia built and utiliaed. 
Conaequencea of the parallel architecture are diatinguiahed from thoae of 
the CIC recoding of the uniproceaaor progrmaa. 


11 


2. APPRttkCB TO A HBH ARRAY FR0C1880R 

2*1 Effect of Context-Dependent Bpcmohee on Syetem Throughput 

Context-dependent brenchee reduce en errey eyetea'e throughput elgnif- 
icantly. With no context-dependent brenchee (eeeuming little or no aemory 
contention)* the throughput of N proceeeore ie E tiaee thet of the eingle- 
proceseor syeten* To illustrete why context-dependent brenchee reduce 
•yetem throughput* consider en errey of 32 proceeeore* Allow this errey to 
operete a prograi conteining a single context-dependent brench with the 
length of the brench being 1/3 of the entire progran (Figure 2.1). The 
entire array executes the first 1/3 of the progran* then the array divides 
into two groups. One group desires to execute the left branch and the 
other group needs to execute the right brench. Obviously* the array can 
only execute one branch at a tine end so one group of the array executes 
its branch while the other group is either disabled or performs no-ops. 

Then the groups reverse roles while the other branch is executed. Finally* 
the entire array executes the last 1/3 of the progm. The time required 
for the 32 processors to execute this progrmn is thus 4/3 the tine required 
for a single PE. Hence* the throughput of the system for this progrmn is 
3/4 the throughput of the array vhen no context-dependent branches are 
encountered. However* the throughput achieved by this 32-FB array is 24 
times the throughput of the single-processor system. One realises from 
this example that long context-dependent branches will reduce the arra^ 
performance much more then abort context-dependent branches. One should 
note thet a program conteining several short context-dependent branches is 
usually preferred over a program containing fewer branches but with each 
branch having a significant length. An example might be a program with 20 
context-dependent branches with each branch constituting .3Z of the entire 
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progr« (Pifnr« 2»2). Tht tim rHoirtd tw aa arrajr MchiM to Maetita 
this ptoftm would ba 1«1 tiwaa ^ feiwa w^irud £<» a aingla n* Banea* 
tha acray pioeaaaar would ba aora than 90S fleiaot on ^la prograi asd au 
array of 32 FBa would ba abla to ^eaaa 32(1/1«1) or 29 tiaaa tha aaooat 
of data in tha aaua tiwo aa a aingla n« 

Ona final axanpla ia tha prograa whi^ contains naatad eontaxt** 
dapsndant branehat. Contidar a progran which containa ilT*laval naatad 
contaxt-dapandant branching. Tha array procaads down ona aida of aach 
decision until it raachaa tha innarwoat dacision. It axeeutas first ona 
and than tha othar branch of tha innamoat dacision until it has axacutad 
auary poaaibla branch of tha traa. By oonpariaon* tha uniprocaaaor pro- 
eaada down tha ap^priata aida of aach daeiaioot axaeuting only thoaa 
branehaa that are nacaaaary* 

Lot Of dtf ins 5^ to ba tha nunbar of branehaa axacutad by aa array 
procaaaor for a progran containing if-laval naatad contaxt-dapandant branch- 
ing. Ona can sac that for s i, 5^ s 4, for if ■ 2 » s io« and for 
if B 3 . 8 22 (Figure 2.3). That ia» 5^ ia tha total nunbar of branehaa 

containad in a progran with if-laval naatad contaxt-dapandant branching. If 
one axaninaa tha flow diagrana earafully* ona notes that 5^^ exhibits tha 
racuraion 


^ifd " ^ ^(if-l)4 


It is now aaaertad that: 

- 2 ^^ + 2^-2 
NA 

Clearly* * 2^ ♦ 2^ - 2 « 4* ao 


Subetituting • 2^ + 2^'^ 


we have a basia for induction. 

2 into tha racuraion fomula for 5^^ 




IS 


yisldt tilt tMult 

Sj^ml ^ 2 ( 2 ^ + 2 ^^ - 2 ) 

or 5^ - 2 + 2^^ +2^-4 

” •*""*♦**-* 

Theroforoa by induction* tho Moortion for 5^ boo boM proven. 

Let ue now define 5^ to be the brnehee executed by e uniproceecor 
for e progroi conteining level netted contexf’dependent brtnebing. From 
Figure 2.3* one notes tbet 5^^ followt the recursion 

It is now ssserted tbst 




and ^(w-l)U “ ^(^1) + 1 


Substituting S,., ,v,, into the recursion f omuls for s„.,» 
{N~l)U NU 

result t 


>btsint the 


- 2 + l2(iV-l) + 1] 


- 2 + 2N~7. + 1 


hu • + 1 


Therefore by induction* the sssertion for 5^^ hss been proven. 

If every brsneb of the progrns it sssunsd to be of equal length* the 
array processor will take 5^^ as long as the uniprocessor t this is 
because the array wtst execute all 5..^ branches of the progrm instead of 
just branches as the single FB vould. It is obvious that nested 
context-dependent branches can drastically reduce systen throughput as 
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b«eoMS lart«* Tha lanbar of FBa raqoirad to allow tiM array procMamr to 
obtain tha oom throoghy^ at tha aUglo pcocoaaw for a eontaining 
lavola of noitad oontast*dayondoat branching iat 


n 


2 ^^ ^ 2 ^ - 
2i7 * 1 


2 


Of cooraa* thia ralationahip aaannaa all branehao to bo of agoal 
length and in a aanaa nay be eonaidarod a worat eaao* Bowavor* one ahonld 
atill note that oven for a roaaonabla lovol of noating* an oMwplo of N • 5 
the roquirod mnbor of PBa ia groatar than 8 (Fignro 2 ad)* Banco* one 
•hould avoid noated contaxt-dapandant branchaa if at all poaaibla* 

2.2 Softbjeupe Consideration: Independent and Dependent Data Handling 

Thara era two principal nathoda of anploying an array proeaaB(»« Tha 
first nethod ia to aasona aach processing alcnant (PB) has its own source 
of data. That is* aach processing unit is processing data idiich are ii^a* 
pandoit of any other PB*s data. Thia* in a sense is parallelisn of tiia 
highest degree and is usually tha siwplaat to is^lasMnt as there need be 
little or no intarprocassor conunication. Thera are nany applications for 
such array configurations. 

The sa«>nd nethod of using tha array processor is to snploy each PE on 
a subset of a larger problen. That is* the data given to each PB are re- 
lated in sone aanner to the data given to the other PBa. An exanple night 
be that each PB is given a row of a large natrix and ia given the job of 
nultiplying that row of elananta with each colusm of another natrix. In 
this way natrix nultiplication nay be perfomed rapidly. 

Dus to the eonplexity of array proceas<»s* array procesaor software 
is typically very difficult to read. Many tines* the software will contain 
a substantial anount of special pinrpose instructions that are very nachine- 
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N, LEVEL OF NESTED DATA CONDITIONAL BRANCHING 

Plfur* 2.4 Graph of array throughput varaua laval of naatad 
contaxt^Uapandant braadilng. 
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dcpmdmt. In fact* dnsl^Mrs of mrr^ BadbinM hnvt dteidtd to hivn 
th« array ptoeMtwr asaeota only ita ovn lan taago » Tbir laacaato 1* an- 
ally opCiaiatd to a bl|b d^oa for Hm pareieoUr array architoetnro* 

Tbit obviooily allooa faator oxaeiition of protraia vritton in and doalfood 
for that lantaago. Bovtrtr* it foreoa any progroi vritton in anothor lan- 
guage to undergo a tmalation. Theae traialationa are aeldon optiaiiod 
and hence euch tranalated progroia are oeually conaidorably lace ficient. 

Bence* if approaching a nev array procoaaor* one ahoald carefully 
oonaider the conpatibility of the nev deaign vith oonuentional langaagea. 
Also one ahould reduni the nunber of epecial inatructiona and other eccen- 
tricitiea to a nininnn. Thia vill aerve to nake the aoftvare nore uader- 
atandable and if one vere to decide to uae a different nicroproceaa«r* the 
converaion vould be a nuch aiaipler tank. 

An array proceaaor should ideally be designed for either separate 
independent data paUia to each FE or a collection of depmdent data. The 
principal difference betveen theae tvo situations is that independent data 
paths typically require nuch leas aophiaticated interprocesaor conenmica- 
tiona. Thus* if one knova that a great najority of the applications of 
thia array proceaaor vill have independent data paths* the ^terproceaaor 
conmunication channels nay bo sinplified or elininated. 

2.3 Context Independent Code and Its Inplementation 

A new concept in the gmeration of array procoasora vill be introduced 
here. This is the concept of Context Independent Code (CIC). The princi- 
ple behind CIC is the elinination of contett-dependent branchea. One 
ahould note that CIC elininates all coatett-depeodMt branchea* not all 
conditional branchea fron the prograi. One cannot rMove all conditional 
branches * since they allow a progranner to execute different segnents of 
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eo4« npoa diffOTtoe eonditlMM b«iat prtmt* Ooa tt»f d^wd— t 

bcMcliM ac« CliOM eonditiostl teOMtes for tkt d«u «c MadUlAs Mf 
bt difftTMt for dUforoiit flo« Xo portiooUr* oooditloatl broMhto uliioh 
«ro oMd Co eonoo Cho progroo Co loop o eorCoia anaOor of ciaoo wo ooc 
eottCoxC-d^oodooCa 00 oil Ro will broMh Cho oom mf ooory ciat. ThoC 
io* Cho ooodicion which dw hrooeh U hMod opoa will bo Cho o«m tot oil 
Ri. Tho raadiCioMl brooehoo for which cho eoodicion will bo difforooc io 
diffocooc Ro waiC bo rocodod oo Choc Cho pregroa ^pooro ooc Co bcwch oc 
oil* for taocoaco* cho aoaol olporicha for aoiciplicocioa ohifco dio aal* 
Ciplicoad cad coico ooch bic of cho aalciplior. If cho aalciplior bic io 
!• CIm aalciplicoad io oddod Co cho porCiol prodacct if Cho bic io 0* Cho 
pcoiroB okipo tho odd iaocraccioa. For tho coavoaCioaol orroy ^coooor* 
oil Ro whooo bit woo 1 woald oneoCo tho odd iaotraccioa whilo tho root of 
tho Ro woro taraod off* 

Thio progrw writtoa io CIC woald coaoo oil of tho Ro to oaocato tho 
odd iaotraccioa* cho difforoaco boing dioc Chooo Ro whooo bit woo 1 woald 
odd tho aalciplior to cho porciol prodaet whilo dwoo Ro whooo bit woo 0 
woald odd ooro to dM porciol prodaet* 

Thio coo bo ochiovod ia worioao woyo* bat of coaroo oao wiohoo to aoo 
the aoot offieiooc aooao poooiblo* Tho aoot offieimt aothod oppooro to bo 
d>ot of ohif tiag ooch bXt iaco tho corry/borsow pooicioa cad tlbm oObtrocc- 
lag choc bit froB cho occoaaloCor whidi hoc provioaoly bow ooC to ooro* 
Thio rooalCo ia oichor R or 00 d^wdii^ oa whodior tho bit woo 1 or 0* 

If oao ttea ding Cho aalciplicoad with the ^owioao rooalt tbo oatcoao io 
oidior Cho aalciplicoad or ooro* Thao* if oao olwayo odte tho rooalt of 
Cho icorotioa JaoC dweribod to Cho porciol ^odaet* oao will bo oddieg cho 
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■altiplicnd (if tiM bit 1)# m t«o (if bit 0) to Hm poitiol 
pfOdttOt* 

Om aiibt ^00 tbot fMreiot o n «boM mltiplior bit U 0 to o44 tbo 
qoootitp 0 io M bott«r tboo hoviog it potfon no-opo turn itoolf off* 
Fro» tbo otantpoiot of porfoniag wortlnrtiilo toobo* ^io io troo* Beoavov* 
tbia opptoocb aecoBpliritoo too tbioga obidt dio proviooa ^proocboa bavo 
not doM* 

I* All tbo Ma oro eooataotlj apnebtoBitod io o lock otep aodo* 

Tbia dooa not rogniro tiio pcograanar to aonao obicb ffo oro oetivo 
and obicb oro oot. Tbo pcoeodoro for dotominiag tbo atotna of 
all tbo PBs can bo ao noo b at tina*-eonaoning and nay roqoiro eon- 
aidorablo bardoaro* 

2* Tbo Mftooro io tailorod to aoit tbo array procoaaor ratbor than 
cailMing an array procoaaor to aoit aaovontial aoftoaro. Tbia ia 
Boro of a ^iloaopbical qooation at tbia tino* Softooro that ia 
doaigood to run on parallol array nacbinoa aboold prooo to bo noro 
officiant than convational oniprocoaaor aoftoaro* Rooovor* it 
oill probably taka aono tino for prograuMra to adjnat to array 
aoftoaro and aoch aoftoaro nay bo initially roaiatod* Tbo nood 
for array procoaoora dioold ovorcono tbia initial rooiotaaco* 

Tbo inplonontation of Contoxt Indopondont Code ia alnoat totally froo 
of raatricciona* Tbo ainglo ro'traiot ia that all PSa nnat oxocnto tbo 
•ana inotroctiona at tiio oano tino* Tbia naana that tbo <^randa of tbo 
inatrnction dotomino irtiich branch tbo PI ia actnally oxocnting* Tbia ro- 
atriction olininatoa tbo poaaibility of difforont PEa oxocnting conplotoly 
difforont branchoa at tbo aono tina oitb all tbo PEa doing oorthohilo taaka 


all tbo til 
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of CXC io vory otroii]ttfmnmt4 io tbot no apociol 
toelmlqoot aro raquirad* Tha pcogronar of tha array ia totally raapoa- 
aibla for aakiat eartalii tiia prograia ara aoeooasiooly vrittaa in Oantaxt 
Indopandent Code. At thia tiatt dio archUaetoro deoa not havo tha c^a-* 
bility to dataet non-GIC prograia and will attanpt to axacuta any progran 
that it ia givan. k nora aophiatieated architaetura nignt hava a conpilar 
that would flag non-CIC prograia* but thia ia bayond tha acopa of thia 
research, Bxanplaa of CIC progroaa will ba included in a later chapter 
along with an agplanation of tha nathod one night use in recoding different 
program into Context Independent Code* 

2.4 Input/Output Conoepte 

Host array processors contain a host processor which controls input/ 
output. Typically the host processor receives the input data and distrib** 
utes it OBong the PEs. The host processor then requests the array to act 
on the input data and when the array has finished execution* the host 
processor then gathers the output data. The output data is then sent to a 
peripheral such as a tape* disc or video screen. 

By using conplete microprocessors as the PSs* one can allow the 
input data to cone directly to each FS via a private input port. Also* if 
each n generates sufficient output data* one way have each PB write to 
its own output port. That is* one might allow all PBs to receive their 
inputs simultaneously* perform the desired functions on the input and 
output the individual results* all in parallel* This* of course* is the 
best use of hardware and ^ovidea the highest possible throughput for the 
array. 

If the input data are such that the outputs generated will be quite 
modest in number* a separate peripheral is n.^t dedicated to each PE* but 
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instead tbe outimte ere ant te ■hared nnnry idMra the eeotrollia^ PB 
otttp^a then to a eli^U perioral* 

tb» eeea nay ha that ana cannot afford a aaparata pori^Mral for each 
PB» but one daairaa a graatar thconghpot than an array with a eingla per- 
ipheral can provide* In this caae» one nay oonaider som special hardnare 
that allows each PE to ontpn to its own port* This special hardware cn 
gather the outputs turn several of the PBa to be stored in a givn periph- 
eral. da n exsnple* as suae that the systaa contains sixteen PBa with only 
four disc syatens* The hardware transfers the outpitts fron four of the PBs 
to each of the disc systens. Various input /output arrangesients are 
possible and the system designer can select the one most suitable for the 
type of application for which the array is intended* 

2.5 General Hardiaare Coneideraticne 

In designing a new array processor* one must consider what techno- 
logical advances are available* Of course* this is true in the case of 
classical computer design as well* However* an array processor has such a 
multiplicity of components that the opportimity for: 

1* improved overall speed 
2* reduced component cost 
3* reduced power consumption 
4* reduced chip count 

is much greater than for a single processor computer* 

Until recently the design of an array processor was restricted to very 
simple PEs iMaohado, 1972] which typically had no local control except the 
ability to decide whetiier or not to execute a given instruction* All other 
controls resided in a single Mntrol unit (CU)* 

With tile advent of inexpensive* single-chip microprocessors* one can 


24 


eonti4«r «n mtif pioeMWw eonaittint of aiiofO^eMtwra* BmH 

•iccopKOCMior r^rMMtt « tintl* proftMtint (n)* Bieli ■icco<- 

procMaor bM a Mall pvivata Maory at ^caaaiag alMmt maery (REM). 

Hm PBa all ahara a larga aaaory callad aliarad aaaory <M)« Tha 
iaatruetlea atraM coaaa fxoa 8M. Aa aaeh of cha aicroprocaaaora haa all 
of tha eontrol logic aacaaaary to oparata aa a aaparata eoaputar* it ia 
raduadant to build a aaparata coatrol unit (GO)* Hanca* oaa My daaignate 
one of tha aicroprocaaaora aa tha controlling procaaaor (CP) and aliainata 
tha control unit* Thia approach alao alloaa tha poaaibla iaplaaentation of 
a certain dagraa of fault tolerance ainca any of tha PBa can becmaa tha CP 
if tha original CP faila. 

Tha daciaion to uaa aicroprocaaaora aa tha PBa raatricta tha vord aiae 
of each PE to tha vord aiaa of tha aicroprocaaaor • Hovaver* aoat aicro~ 
procaaaor a allow for aultipraciaion arithaatic which allowa one to achieve 
tha dagraa of praciaion required for a given application. Of couraat once 
one haa decided to uaa aicroprocaaaora aa PBat a apacific aicroprocaaaor 
auat be aalactad* A diacuaaion of how one My aalact tha aicroprocaaaor ia 
preaentad in a later aaction* 

The next daciaion in tha daaign of an array procaaaor ia tha form of 
addraaa and data bua ayatam to be Mployad. One haa tha choice of a aingla 
bua ayatan with both addraaaaa and data multiplexed on tha aama bua or a 
two-bua ayatam with aaparata data and addraaa buaea. Tha latter aaama to 
ba tha moat popular with adcroprocaaaor daaignara* baaically bacauaa it 
aimplifiaa tha overall ayatam. 

Hanca* one ahould raeognica that a two bua ayatan ia aimplar to uaa 
and aaaiar to build into an array ayatan. Next* one muat conaider tha PE 
to aharad memory (Rl) connection* Thera are aaaentially two typaa of 
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oottwetieiws 

1* Stand Addntt/Stand tat* taa 

2. Miiltii^l* Addr«tt/Raltipl* tat* tat 

ta* first tppt it the ettictt to i^ltamt hot itqMtet * pottihle 
bottlraeck irttea the trrty conttint Mwe tod amre nt* One theold note Chet 
for the trrty procetMn» tinee every PB esecotet tiie tt*e inttroetion* * 
BBAD fro* 8M ctn be eotcoted tianlttneouely* It it only the write to Slf 
that watt be ewectoted teqoeatially* Thit is becmse on can only write on 
value to a specific address at a given tin* Extra hardware might be need 
to place each FB*s data word into a queoe irtien a WRITB to SM is performed* 
nie large nnaber of flEITES to ta could thn be executed in parallel with 
the PBs encuting instructions whidi require access only to ^ivate aowry. 
This would improve the effective transfer rate from the array to SM consid- 
erably* On requirement for this arrangement is that the values written to 
SM not be neded by any FE for a certain minimum tin after the WRITE to SM 
was performed. 

The second type of connction is considerably nre difficult to imple- 
ment completely* This type of connection requires SM to have the capabili- 
ty of communicating with several pairs of address nd data bases* Memories 
of this type are comnnly called true multiport sienories* Multiport opti- 
cal memories are now being researched [BarvtMtl et al,, 1978], but a true 
multiport as yet has not been placed on the nrket* One can simulate a 
memory with several ports by using extremely fast memory which is eight to 
ten tins faster than standard memory chips. However* this does not solve 
the problem* when the number of nts becons greater than eight or ten* 

Because the technology for multiport isemory is not available at this 
time* one is forced to consider a diared bos systn with perhaps some type 
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of quoue to aako tho WRITE to 8M appoor to tlio iadividiMl tB to t«ko ^bont 
40 long *• n VEITB to its PBM. Cwtninly ono vonld not want tha WRITE to 
8M to taka longar and longer aa the naabar of fEa growa. Bawavar* tha 
present vork does not address tha quaoa problea* Thvs» in order to allow 
different PEs to write to SM* tha architaetora adopted requires each PE to 
bacoae the CP in order to write to SM. This architaetora has tha unfortu- 
nate property that a WRITE to 8M takas longer and longar as the nuaber of 
PEs grows* This would not be significant if tha PEs wrote only to SM to 
transfer final results at the «id of every prograi* 

One must <»neider what type of inter^ocess<»r omwiinication is desired 
for the array processor. As previously noted* aost array processors have 
nearest-neighbor connections. The aost comaon fora of coanunication is 
that of a single word. However* in order to provide a reasonable liait to 
this thesis* the design iapleaented allows no interprocessor coaaiinication 
other than through SM* 

One final consideration is what aethod to use in transferring control 
froa one PE to another. One can write the nuaber of the PB desired to be 
CP to a specific address in aeaory* Altmnatively one can use an uniaple- 
aented opcode of the aicroprocessor as a special instru<^ion whose operand 
designates which PS is to becone the CP. Finally* one can address a 
particular aeaory location (a so-called 'soft switch*) in order to cause a 
particular PB to becoae the CP* The aeaory address is decoded to deteraine 
which PE is to be the CP* The first aethod uses a single aeaory location* 
but requires substantial hardware to latch the data word and decode which 
PB is desired* The second aethod uses no aeaory location but requires an 
un iapleaented opcode* This could lead to difficulties if the chosen opcode 
were to be used in a later edition of the aicroprocessor* Also* aost 
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microprocessor msnofscturers will not gunrsncee irtiet s microproceesw will 
do when it ettcmpts to execute <n unimplem«ited opcode. TIm ttird method* 
that of addressing a particular location in order to determine Which PE is 
to be the CP is selected for use in the array processor design. The 
details of the array processor architecture are fully described in a later 
section. 

2.6 Seleating the Mioropvooeeaor fox* a Multi-MicpopXKfoeeBOP System 

Selecting the microprocessor for a multi~microprocessor system in- 
volves many of the same considerations necessary when one wishes to design 
a single**microprocessor system. 

One of the main consider ationa is whether or not a given microproc- 
essor will be readily available* either for expanding the array or in case 
of component failure. Sauin [1977] provides a relatively complete list of 
available microprocessors. 

Another important factcnr is compatibility. That is* whether or not 
the microprocessor is designed to be easily interfaced both with peripheral 
chips and with other microprocessors of the same type. With the variety 
of microprocessors available today* an extenaive comparison of all of the 
possible choices would be quite lengthy. However* one can reduce the sel- 
ection considerably if one is interested only in general-purpose microproc- 
essors that are reasonably ineiqpensive. This removes special-purpose 
microprocessors from consideration. Also* as yet* 16-bit microprocessors 
are relatively eiq>ensive and are not used extensively enough for them to be 
readily available. Thus* one should not attoipt to use 16-bit microproces- 
sors in a multi-microprocessor environment until they are more readily 
available and their unit price is reduced somewhat as will inevitably 
occur. One might consider a possible modificaton of the architecture at a 
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later Um to allow oao of 16-bit aieioproooceora ratter ttao the atmdard 
8-bit aieroproeeaeora which are ia wideapread aae t^y* 

Row one would lihe to reduce tte overall ehip comt for the entire 
array. Single-chip 8-bit aieroproeeaeora are now readily available and it 
aeeaa appropriate to aelect a ainsle-ehip ^oeeacor if at all poaaible. 

Another factor to be concidered in narrowing the ae lection of a aicro- 
procecaor ia that of veraatility. Doea the proceaaor allow atraightforward 
inplenentation of ■ulti-byte arithawtic? Thia capability ia extreaiely 
i^ortant in an array proceaaor ainee any applicationa of array proceaaora 
require proceaaing 16- and 32-blt data. Sanilarlyi one ahould conaider the 
architecture of the propoaed nietoproceaaor. Row nany regiatera are 
available to the progranaer? How nany different addreaaing aodea doea the 
proceaaor aupport? Doea the proceaaor have the capability of inplementing 
a atack? What addreaa range ia the microproceaaor capable of addreaaing? 
What type of interrupt capability doea the nietoproceaaor have? Ia the 
progranner able to halt the nicroproceaaor? What are the conaequencea of 
halting the nicroproceaaor? Doea the nicroproceaaor allow for direct 
nemory acceaa (DMA) by other devieea? 

One inportant area that ahould be analyaed carefully ia the inatruc- 
tion aet of the nicroproceaaor. Doea the inatruction aet contain all the 
eaaential inatructiona required to perfom arithantic* logical and progran 
control fnnetiona? How efficient ia the PE with reapect to the nunber of 
nachine cyclea required per inatruction? Doea the nicroproceaaor require 
aeveral nachine cyclea in order to execute even the ainpleat inatructiona? 
If thia ia aot the nicroproceaaor nay be actually nuch alower than another 
aicropxoceaaor that haa a alower clock rate but requirea fever nachine 
cyclea per inatruction. 
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Om thovld «lto oontidtr d«v«lo9MHt «id« art available far a 
given aicroproceaaor* For inatance* are tiiere entire ay at ana iMiaed on a 
given nieroproceaamr aneh that one oonld nae an aaaenblert editors nd 
other diagnoatic aide? Alao* the ability to uae a ooBURereial ayatoa with 
keyboards video and monitor already in working order ia invaluable. 

In aonmary thens the <^tiimin mieroproceaaor for a wnlti'microproceaaor 
ay stem would t 

1. be a aingle-chips 8*bit microprocessor 

2. be readily available 

3. be easily interfaced with various peripherals as well as other 
microprocessors of the sane family 

4. allow multiple precision arithmetic 

5. have as many registers as possible 

6. have many modes of addressing 

7. be capable of addressing as large an address range as possible 

8. support sufficient interrupt levels 

9. have the ability to implement a stack and thus allow subroutines 
to be used effectively 

10. normally require few machine cycles in order to execute a given 
instruction 

11. contain a relatively powerful instruction set that allows the PE 
to be very versatile 

12. operate at a clock rate that is competitive with other possible 


choices 
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3. BAUMOI A8RCX8 OF 1U 8Dm-6S UDm-iaCiOnOGIUOA 8V8fBI 
3*1 Attvibuf9 of th§ 6602 Miaroproaeaaor 

Xn th« prcviout Motion* An boot otaoiM of tbo nietoptocottor om 
rodueod to rondily noniloblo* iinglo-ehip* 8 -bit* gonorol purpoM ■ieto- 
procMMTO* It «M alM notod thot oil niczoprocoosMt eonoidorod ohoold 
oporote ot o eonpotitivo clock rote and bo capable of addresaing a large 
addreaa range. Tbeae reatrictione reduce the aelection of auitable nicro- 
proceeaora to eaoentially three. They aret 

(1) Zilog Z-80 

(2) HotoroU MC6800 

(3) nos Technology 6502 

Bach of theae nicroproceeamra ia a general purpoae* aingle chip* 8-hit 
nicroproceaeor. Each requirea a aingle ♦S-volt aupply and ia TTL compati- 
ble. Theae microproceaaora nay operate at varioua clock ratea and in order 
to compare then fairly* one muat conaider inatruction execution tinea at 
the aane clock rate. A atandard clock rate ia one MHc. 

The nininun inatruction execution tina ia one type of conpariaon Which 
providea the ayatan deaigner uith a clearer underatanding of the relative 
perfornunce of different microproceaaora. With a clock rate of 1 MHc* the 
Z-80 haa a nininun inatruction execution tine of A micro Mconda iGarlandt 
1979]. The NC-6800* and the 6502 both have a nininun inatruction execution 
tine of 2 nictoaeconda with a clock rate of 1 MBa lArtiriok, I960]. 

The Z-80 haa 158 inatruction opeodea* two 16-bit index regiatera* a 
16-bit atack pointer and 14 general-purpoae 8-bit regiatera [JBorden, 1978]. 
The Z-80 haa the following addreaaing awdeat 

(1) Implied Addreaaing (6) Extended or Abaolute Addreaaing 

(2) I»*udiate AddrMaing (7) Modified Page Zero Addreaaing 
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(3) Bxtttdtd iHMdUte AddrMsittg . (8) RsUtiv* ad^Mtins 

(4) Regitter Addruaing ( 9 ) Indexed Addresaiag 

(5) Regiecer Indirect Addreaeing (10) Bit Addreaaing 

However • the Modified Page Zero Addreaaing node ia naed only for one 
inatrttction* the Reatert Page Zero inatruction* Alao the Bit Addreaaing 
mode ia uaed aolely to aet* deer or teat bita in e given word. Final ly. 
the Extoided Inmediate Addreaaing node aimply indicatea that the innediate 
operand ia 16 bita rather than 8 bita. Thua> the Z*80 actually haa only 7 
different addreaaing awdea. Another point to conaider ia that many of the 
Z>80 inatructiona are due to the large number of regiatera available and do 
not give one a greater variety of inatructiona aa one might be tempted to 
think. 

The MC 6800 haa 72 inatructiona* one 16*bit index regiater* a 16-bit 
stack pointer and two 8*bit accumulatora. The MC 6800 haa the following 
addressing modest 

(1) Implied Addressing (4) Absolute Addressing 

(2) Inmediate Addreaaing (5) Relative Addreaaing 

(3) Zero Page Addressing (6) Zero Page Indexed Addressing 

A major deficiency of the 6800 ia its lack of an indirect addressing mode. 
One also notes that indexing can only be done in the aero-page mode. 

At this point* a historical perspective helps in understanding the 
evolution of these three microprocessors. Saanlon [1980] relates that in 
1973* Intel Corporation introduced a second generation 8 -bit microprocessor 
called the 8080. The 8080 was designed with a calculator-like architecture 
with eight scratch-pad registers* an internal stack register and special 
input and output inatructiona* 

Motorola Inc. saw the tremendous microprocessor market potential 
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•votvingi o4 teei^ to m mtgj of cMv ovo* ttoj iMid tm ehoiooo* 
Thty eoold eholUot* Xatol Oorporo^o on tiMir giooiA bp ptofoeiag o otv 
nd ijiproo^ vortioa of tiM SOM* •• Sll^ Xm» did io 1070 with tto S-M« 
Tht othor eteieo vAt fee dtti|o o wro odvooeod ■tcro p to c ooiog. iMlisii^ 
tht difficnltj of tlio first ^roseli* Msterols doeidM to eholUngo Xotol 
CorperstioD with s snptrior piodoet* 

Per the 6M0 aiereproeossor* MsteroU sbssdeasd the eslcnloter-liks 
rtgister^oritiitsd srehitoetars of the SOM* nd adopt M a eUssie aiAicoa- 
pvtar-lika ■aaory-oriootod arehittoMro* da a raaolt* tho OMO has tawar 
(aod easier to eaderataad) iaatruetioaa* with aore addreaaiag options than 
the 8080. 

The preceding brief owerriew is necessary in order to set the stage 
for introdncing onr chosen aiecoprocessor* the 0S02* The 6502 derice was 
designed hy es-«ployees of Motorola who saw that advances in processes* 
coupled with a few architectural and software clunges* conld resnlt in a 
potentially highly narketsble OMO-like aicroprocessor. They joined a 
cal culatMT -chip conpuiy called MOS Technology. 

The N08 Technology design tea had two ohjectiwes in aind for their 
next generation aicroprocessor— low cost ad high perforaance. Thsy re- 
duced the conplexity of the haic OMO design a anch a possible to 
increase chip yield. Daign chaga ialodad eliaiating oa of the two 
sccualators in the OMO ad its tristate addras atpa buffers. They re- 
plead thm 10-bit inda regista of the OMO with two aparate S-bit index 
registas ad discarded sae of the laar-aed iatrnctioa of the OMO. 

The eliaiation of these instrnaions opead op sea iatroction- 
deade spaa ad axvitted the daigars a prowide the 0502 nicaprocasa 
with 13 addrasing ada* 7 aaa ada tha the OMO. Thea noda giwe the 
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6502 tMt Mft iMKnllf fouad osly to !•*§•« o oapto t ri * TULt 

•dtoatttot c^Mlit^r it en^^Macvd bf to« astraalf fMt at ahieh 

tha 6502 aaa aataata toatrattimi Mqaaaaaa* Thia apaad la pvtoavilf doe to 
tOa fact tOat tiia 6M2 la daaigaad with a pipaltotog taetai^oa to aOioh too 
■icfopvaawaar fatahM tOa aa«t toatraation bafoaa it ia dooa pcoaaaaiat 
tha c«mat toatractioa* AdditioMlly* tha d«iigtt ta« add^ a daelml 
aoda aalaet inatroctioo aad ooatrol bit toat allom tha 6502 to oparata %m 
mithn biaarp or daetoal data* Thia Maaa that toa prograoMr doaa aot 
hava to raMBbar to orita to a *daatoal adjaat* toitooctioB tftar avaxy 
addition or aubtractioa* Alto tha a aa ar daplation-load taehaologp «aa 
aaployad* which givaa tha 6502 claaiar awitchtog eharaetariatiea aad loaar 
powar diaaipatioa* Tha 6502 typically diaaipataa 250 aW varaaa 500 aW typ- 
ical tet tha 6000* Thia tachaology alao raaolta ia battar noiaa iaaaadty* 
Tha addraaaiag aodoa for tha 6502 atat 

(1) Xapliad Addraaaiag 

(2) XaMdiata AddrMaiag 

(3) Earo Addraaaii^ 

(4) tuto Indaxad (Irag) Addraaaiag 

(5) Earo Faga ladaxad (Trag) Addraaaiag 

(6) Abaoluta Addraaaiag 

(7) AbMlnta ladaaad (Erag) Addraaaiag 

(8) Abaolata ladaMd (Trag) Addraaaiag 

(9) Ralatiaa Addraaaiag 

(10) ladiract ladaxad (Trag) Addraaaiag 

(11) l a d axa d ladiract (Erag) Addraaaiag 

(12) ladiract Addraaaiag 

(13) Accaxalator Addraaaiag 
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TIm 6902 luM 96 4iff«r«ic typas of iattro^iou* with 191 Oifforoiit 
iootroctioB opeodM (i.o.» 191 difforont iaotroetioM)* Om roodily m« 
tliot tht 6902 oOviootly hM tho aoot poworfvl oddrooting e^bility of tbo 
throo aicfopfoeosMOt* Tte 6902 inotrottioB tot it olaott ogutl in tint 
with tbtt of tho Z^« Oat hoy tdvtntt|o of tho 6902 ovor tbo 2-60 it thtt 
for tht ttat clock rttt« tho 6M>2 it at lottt twico at fatt aa tho 2-60* 

At aiccoprocattort art boilt to run faator tad faatar* tht dataxaitiBg 
factor it not how fatt tn inttruction it oxoeutod in thtolitft tiat» but how 
nany aachiaa cyclaa art roqoirod* la thit rotpoct • tht pipaliaiag dono i y. 
tht 6902 atkot tht 6902 iattruecioa saaewtioa tint with roaptct to atchiat 
cyclot very afficitat* 

Tht 2tro Pago Addrttting capability allowa tht 6902 to att all 256 lo- 
eatioat la pagt taro of aaaory at though thay wart ragittart. Thit allowt 
txtrtaa vartttility in progrtaaiag tha 6502. In ftctt if tha progrtaaar 
utat thit fattura of tht 6902 proparly* it it poatiblt to ratlita op to « 
nothar factor of two ineratta ia tpaad ovar tha othar aiccoprocattort. Tha 
ptga taro of ataory can ba utiliiad by tha 6902 at aora powacful coaptftart 
uta ctcha ataoriaa. Thit fattura tloaa Mkat tha 6902 an aaeallant choica 
for «i array procataor. Vhaa oat contidart tha ainiatl powar raqoirtaaatt 
of tha 6902* tha tuparior tddrattiag capability* tad tha highar throughput 
dua to pipaliaiag* thara it abaolutaly no batt«r choica. 

Although tha dacition of tha iRicroprocattor hat tlraady baan atda* ona 
thould point out that tha 6902 doat indaad aatiafy and in toaa cttat* aa* 
caeda tha ra^uirtaantt tat fwth ia tha pracadiag tactioa. Tha 6902 it tha 
not! afficiaat of tha thraa uicroproeattora both in Carat of tpaad tad 
powar. It hat tha note powarful tddraatiag capability of tha thraa. It 
bat ao atrictly gaaartl purpota intamal ragittart but aora than atkat up 
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for thii by tho oblity to um Cbo ontiro smo pof* oi oMory m rogiet«rt 
tod by btfii^ oot OM but two iodox rogiotort* Xc boo ottek eoyobilitiot 
•0 tktt MM My utiliot bock odbvootiBM tod iatorroyt vootiaoo* It boo 
tho ottodard two lortl iUitorroyt etpt^iUty< hot both Mokobit 

Md OM-Mtkablt iatornii^t. Zt hot oo-botrd clock circuitry which rcducto 
the coapoMOto Mcofttry f«r t aialMl aicrocoi^otor tyotat [Ca^ at al,g 
1979]. Itt inttructiofi oot provldoo tho pcogrooMr with til tho ottmeitl 
retourcot for progrtMing the nott dlvoroo progrtno. Tho 6502 it oxtromtly 
otty to intMftco both with othor 6502 aicroprocottoco tod with odior 6500 
torioo c oa p o o Mtt ouch t» inpot/ootpot chipo. Tho tddrott boa of tbm 6502 
tlmyt hot t otlid MBory tddrooo* Thit tllowt for wneb ootior tyochcooi- 
totion of ttwortl wieroprocootort* Tho 6502 hot two control linot ctllod 
*RBADT* tod *8TNG* which allow tho poioibility of tiaglo ttoppiag through a 
progrM* If tho UADT lino ia pullod low during ^aao ona of a 8TMC high 
cycle* ainglo'ttop operation of tho 6502 can bo achiovad. The 6502 
providoa a rector ad react operation that allowa one to prograi a unique 
initialisation routine to fit onoa own nooda. Tho 6502 haa bad widoaproad 
uao* not Juat aa a nicroprocoaaor in oaor*doaignod ayatMt but in nany 
connarcial nicroconputora audi aa tho Apple* AIK-65* 8TH* KIM* KT and nany 
othna. Thua the 6502 ia oaaily awailablo and ia conpati^lo with nany 
connorcial nicroconputora. 

Obvioualy* tho final dociaion of idiich nicroprocoaaor to uao ia aono- 
what aubjoctiwo* but it ia intoroating to note that tho latoat 16-bit 
nicroprocoaaor fron Zilog* Ine«* tho Z-8000* haa a nenory-oriaotod archi- 
tectura (auch aa the 6502) Which r^roaonta a aolid break with the 8080/ 
Z-80 architectural doaign concept of regiator-oriented nicroproceaaora. 
Alao* thia night indicate that a doaign with the 6502 nicroprocoaaor would 
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be sore eeeily eonverted to the newer 16-blt ■i«Foproeeoeoro rtiowld one 
ever decide to wodify the errey proceeeor* 

3*2 The kpg%B 11 IHorooonfut^ System 

The Apple II ■ierocowpeter eyetee ie e vereetile wiciooom^ttttt thet 
ewploye the 6502 ee ite mieroproceee<«. The Apple II hee the cepebility of 
the full 64K of MBory* neing dynawic RAM to reduce coot and power ooncui^- 
tion. Tue Apple II provide! the 6502 with e 1*023 MBt clock eiguel which 
it supplied to the pheee lero (Pq) input of the 6502* The wicroproceesor 
uses it! eddrese end dote buaee only irtien pheee sero ie high. Vhen pheee 
zero is low» the wicroprocessor is doing intemel operetione end does not 
need the dete end eddrese buses* The Apple II designers el low the nenory 
to be refreshed et a 3*5 MBs rate* when pheee sero (p^) is low* In this 
way* nenory refresh is entirely transparent to the 6502* Espinosa [1979] 
explains the systcn tining entirely and provides a schenetic of the Apple 
II that is invaluable to the systen designer* 

The 16*bit address bus lines are buffered by tristate buffers. The 
address lines are held open only during a SMh cycle and ere active at all 
other tines* A 0M4 cycle also halts the 6502* The address on the address 
bus becones valid about 300 nsec after phase one (conplensnt of phase sero) 
goes high and reneins valid through all of phase sero (see Figure 3.1). 

The 8-bit date bus lines ere buffered by bi-directional tristate 
buffers. Data fron the nicroprocessor is put onto the data but about 
300 nsec after phase one (4^ ) and READ /WRITE both drop to sero* At all 
other tines* the nicroprocessor is either listening to or ignoring the data 
bus* 

The RDT* RBS* IRQ and HMI lines to the nicroprocessor are all held 
high by 3.3k Obn resistors to +5V (see Figure 3.2). These lines also 



Se« 6M2 Hardware 
manuals for details. 


Figure 3.1 6502 timing signals. 
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appear on the peripheral connectmra (aee Flgore 3*3 and 3 •A). The Set 
Overflow line to the nicroproceeaor ia pemanently tied to ground. 

All tiaing aignala are derived froa a 14.318 MHt aaeter oscillator 
output. The 7.159 MBs interaediate tiaing signal and the 1.023 MBs signals 
phase zero and phase one are the only timing aignala asployed in the deaign 
of the Super-65 array proceaaor. 

The Apple can support up to six 2R x 8 mask prograsmed REAO-Only 
memory chips. One of the six ROMS is enabled whenever the microprocessor's 
address bus holds an address between $1)000 and $FFFF. Thus* the address 
range $DOOO-$FFFF is reserved for ROM. 

The Apple supports up to 48K x 8 Random Access Memory (RAM) or 
READ/WRITE memory. As previously mentioned* this RAM is dynamic and is 
refreshed automatically during every phase one cycle. The Apple sup- 
ports a sophisticated video system, hut this need not be discussed in 
detail here. 

The Apple provides two female miniature phone jacks that allow one to 
connect the Apple to a normal cassette tape recorder. In this way. one can 
store user prograns permanently on tape without incurring the e^ense of a 
complete tape system. 

The Apple provides users with eight peripheral connectors along the 
back of the Apple's main board. These slots are designed to allow the user 
more sophisticated resources such as disk drives, the ability to program in 
high level languages directly, etc. Also, the Apple designers give the 
user the option of plugging in proto-boards containing user-designed 
circuits. 

The Apple designers intend slot zero as a special purpose slot so that 
many of the options available to the other seven slots are unavailable to 
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This liiw, normaUy M|h, wffi become low when 
the mtewiwecieeeer lefae n cee Me SCe. whme 
N ie die individual ahM number. This aftnai 
b ecom es active durini M and will drive 10 
LSTTL loadt*. This i%nal is not present on 
peripheral connector 0. 

2*17 Af*AlS The huSbrad address bus. The address on 

lhaae Unes becomes vahd during and 
remains valid through <ht. Thesd Unes will 
each drive S LSTTL loads*. 

IS R/W BulTered Read/Wriie signal. This becomes 

valid at the same time the address bus does, 
and goes high during a read cycle and low dur- 
ing a write. This line can drive up to 2 LSTTL 
loads*. 

19 SYNC On poripiieral connector 7 onfy. this pin is con- 

nected to the video timing generator's SYNC 
signal. 

20 170 STROBE This line goes low during M when the address 

bus contains an address between SC?3i and 
SCFFF. This line will drive 4 LSTTL loads*. 

21 RDY The 6S02's RDY input. RuUing this line low 

during ^1 wiU hall the microprocessor, with the 
address bus holding the address of the current 
location being fetched. 

22 OMA Pulling this line low disables the 6Sg2's address 

bus and halts the microprocessor. This line is 
held high by a 3Kn resistor to +5v. 

23 INT OUT Daisy-chained interrupt output to lower priority 

devices. This pin is usually connected to pin 28 
(INT IN). 

24 DMA OUT Daisy-chained DMA output to lower priority 

devices. This pin is usually connected to pin 22 
<DMA IN). 

25 Sv -t-S volt power supply. SOOmA cureni is avail- 

able for M peripheral cards. 

26 GND System electrical ground. 


27 DMA IN Daisy-chained DMA input from higher priority 

devices. Usually connected to pin 24 (DMA 
OUT). 

26 INT IN Daisy-chained interrupi input from higher 

priority devices. Usually connected to pin 23 
(INTOUT). 

29 NMl Non-Maskable Interrupt. When this line is 

pulled low the Apple begins an interrupt cycle 

and jumps to the interrupt handling routine at 
location $3FB. 


Figure 3.4 Peripheral connector descriptions 
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30 

IRQ 

Iniemipt R«Quml When thU line it putted 
low (be Apple beeini on iiuomipi cycle only it 
ilw 6Si2*t 1 (Intomipi ditoMe) fill it not tet. 
if to, the 6$tt «4n jump to the interrupt han- 
dling nibroutine whoee oddren it ttored in 
loctUcni! S3FE and S3FF. 

31 

KB 

When thit Una it pulled tow the microprocetior 
bcfini a RESET (tee page 36). 

32 

mn 

When thit line it pultod tow, all ROMt on the 
Apple board are ditabled. This line it held high 
by a 3Kf) reaittor to Sv. 

33 

-12v 

-12 volt power aupidy. Maarnum current it 
200mA for aU peripli^ boarda. 

34 

-Sv 

-S volt power aupply. Maiinuim current it 
200mA for all peripliiMal boards. 

3S 

COLOR REF 

On peripheral coonector 7 oiify, thit pin it con- 
nected to the 3.SMHt COLOR REFerence tig- 
nal of the video generator. 

36 

7M 

7MHa clock. Thit line will drive 2 LSTTL 
loads*. 

37 

Q3 

2MHz asymmotrical clock. This line will drive 
2 LSTTL toadk*. 

31 

♦1 

Mwtoprocettor's phase one dock. This line 
will drive 2 LSTTL loads*. 

39 

USER 1 

This line, when pulled low, disaUet oil internal 
I/O address decoding**. 

40 

M 

Microproceator't phase lero dock. Thit line 
will drive 2 LSTTL loads*. 

41 

DEVICE 

select 

This line heceniet active (tow) on each peri- 
pheral connactor when the tddrea bus is hold- 
ing an atMeH between SClai and SC9^, 
where « it (ha stol number plus Sg. This liiM 
will drive 10 LSTTL toads*. 

42-49 

M-D7 

fiufliered bMireclional data but. The data on 
this line becomes valid 300nS into Pt on a 
write cydc, and should be stable no lets than 
lOOnt before the end of M on a read cycle. 
Each data line can drive one LSTTL load. 

SO 

+ I2v 

+ 12 volt power supply. Thit can supply up to 
2S0mA total for all peripheral cards. 

Figure 3.4 (coat.) Peripheral connector desctlptlone 
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•lot •oro. Boob elot it givon tixtooB locationt boginniog at $C080 for 
gonerol input and output purpoaaa* for alot taro* tliaaa tixtata locationt 
art $C080*$(KMFt for alot ona thep art $0090 -G09r* ate. Vhanavar tba ad- 
drett on the addratt but it in a givan tloe*t allocated ranga* pin 41 
(called Device Select) goat low* Tbit alertt tbe particular card that tbe 
addrett it tonewbere in that peripberal eard't 16-byte allocation* 

Each peripberal alot alto bat reterved for it one 256-byte page of 
memory* Tbit page it utually uted to houte 256 bytet of ROM* idiich con- 
tains driving prograt or tubroutinet for tbe peripberal card* Tbe page of 
memory reterved for each peripberal alot bat the page number $Cm* where 
it tbe alot number* Tbe tlgntl on pin 1 (called I/O Select) of eacb per- 
ipheral alot becomet active (drop to ground) when the mieroproeettor it 
addressing an address within that slot's reserved page* 

The 2K memory range from location $0800 to $CFFF ia reterved for a 2K 
ROM or PROM on a peripberal card* to hold large prograss* etc* The ei^an- 
sion ROM apace alto hat the advantage of being abtolutely located in the 
Apple' • memory map* which gives one more freedom in writing interface pro- 
grams* This PROM space is available to all peripheral slots and more than 
one card can have an expansion ROM. However* only one expansion ROM can be 
active at one time* The expansion ROM typically requires 2 enable inputs. 

A suggested method is to use pin 1 (I/O Select) as one enable and pin 20 
(called I/O Strobe) as the other* The I/O Strobe line goes low (active) 
when the address bus contains an address within the ei^ansion ROM space 
(i*e«* between location $C800 and location $CFFF). 

Thus* each peripheral card has available tbe buffered address bus* 
buffered data bus* buffered READ /WRITE line* the READY line* the Mon- 
Haskable Interrupt line* the Interrupt Request (IRQ) line* and the Reset 
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liiM. Othar laadt availabla iaeladat 1) tha IK liaa vliieli diaablaa dia 
6S02'a addraas bna and balta tha aieioproeaaaort 2) dia t/0 ^alaet lina 
which goaa low (activa) on tha paripbaral card idian tha addraaa bua con- 
taina an addraaa within paga $Cnt whan n ia tha particular alot nunbar* 

3) tha Davica Salact lina which goaa active (low) on paripharal connactor 
whan tha addraaa bua ia holding an addraaa batwaan $C0n0 and $C0nF« iriiara 
ia tha particular alot nuwbar plus eight and 4) tha l/o gtroba lina which 
bacowaa activa (low) on all connactorst idian tha addraaa on tha addraaa bus 
ia batwaan $C800 and $CPFF. Of couraa* all of tha paripharal connactora 
hava phaaa xaro (4 q)» phaaa ona (d^)t and tha 7 Mia clock aignala available 
for aynchronixation of tha earda with tha 6502. 

Tha Apple Syatau aonitor acta as a auparvisor of tha ayataw* From tha 
monitor ona may look at ona* soma* or all mamory locational ona can write 
programs in Machine and Aaaenbly languagaa to be executed directly by tha 
Apple* a aicroproceaaor; one cm save data and prograna onto caaaette tape 
or a floppy diak and read than back in againt one can move and compare aav* 
aral bytaa of SMoory with a aingla conanndi and ona can leave the monitor 
and enter any other program or language available on the Apple. 

There is a program within tha sMnitor which allows one to type pro* 
grams into the Apple in assrably level language. This program is called 
the Apple Mini*Assenbler. It is a 'mini* -assembler because is cannot 
understand symbolic labels* something that a full assembler must do. For 
details in using the Apple Mini-Assamblar one should refer to tha Apple II 
Reference Manual (pp. 49*31). 

Tha Apple 11 Monitor provides facilities for stepping through programs 
both in single step and trace sMda. Also* ona is able to examine the con- 
tents of the 6502*0 internal registers after each instruction is executed. 


45 


This allows one to properly debug difficult prograas in a vary atr eight- 
forward Banner* 

One can see froa the above deecription« that the Apple II gives the 
user a very powerful microcomputer with all the necessary facilities to 
write* assemble* debug and execute progrees varying from machine-level to 
high-level languages* It also provides more than adequate means for exten- 
sive use of peripherals* Thus* the Apple II is an excellent choice for 
implementing an urray processor as it is not only versatile but economical 
as well* 

3*3 Arohiteature of the Overall System 

The previous two sections provided the necessary inforaation for the 
two major building blocks of Che 8uper-65 Systemt the 6502 aicroproceasor 
and the Apple II microcomputer* One can now proceed to the description of 
the Super-6 5 System* 

The Super-65 Multi-Microprocessor System consists of four 6502 micro- 
processors* one being the Apple 11 6502* with three others* Each 6502 is 
given a private random access memory of dimension IK by 6 bits* This 
memory resides in the lowest portion of the 6502's s»mory map* The rest of 
the 6502's memory (the upper 63K) resides in the Apple II* This will be 
both RAM and the Apple 11 Monitor BOM* The combined SAM and ROM are joint- 
ly designated as Shared Memory (SM)* One can see from the block diagram 
(Figure 3*5)* that the architecture of the Super-65 is relatively simple* 
Each processor is able to access its ^ivate memory at any time but only 
one processor is able to access the Shared Memory Address Bus at a time* 

All processors may access the SM Data Bus on a READ* However* for obvious 
reasons* only one processor nay access the SM Data Bus during a HRITE oper- 
ation* One notes that the architecture allows for both shared input/output 
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syste. block diagram. 
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and private in^/outpitt* Thia allmra the apatM to be aaad in aanp dif* 
ferent envisoaMnta» ai^le aenrea and aingia daatinatlan* aiagla 

aanrca and naltipla daatinatian* Baltipla aanrca and aingla daatinatian# ar 
anltiple aanrca and anltipla daatinatian* 

There ia no proviaion for intarproeaaaor conmnicationa other than 
throogh Shared Menary* Bach proceaear with ita private naaory* private 
input /output and ita portion of the Hardware Arbitrator ia aounted on a 
•ingle prototype board which can be plugged directly into one of the aevea 
availahle peripheral alota on the hack of the Apple II* The Hardware 
Arbitrator coneieta of trietateahle buffera on the Addreae* Data and 
Control buaea of each of the proceeaore with the required logic and flip- 
flope to enable the appropriate huffera* 

One of the proceaaora (typically the Apple IX 6502) ia deaignated aa 
the Controlling Proceaaor (CP)* The CP haa control over all ahared re- 
•ourcea. The CP haa aole aeceaa to both the Shared Addreaa and Control 
Buaea. The CP allowa all of the proceaaora acceaa to the Shared Data Sue 
during a Read from SM. At all other tinea* the CP haa aole aeceaa to the 
Shared Data Bua. 

One intereating feature of thia architecture ia that it requirea al- 
moat no modification of the An>le II* The reaaon ia that the Apple II 6502 
ia phyaieally r«oved from ita aocket and plaead onto a peripheral board* 
However* a 40-pin conductor connecting the removed 6502 to the empty aocket 
allowa the Apple to operate aa if the 6502 were actually atill in the 
original aocket* The Apple II 6502 then ia equivalent to any of the other 
proceaaora* Thia allowa for a very modular deaign* That ia* every procea- 
aor board ia idantical to any other ^oceaeor board* Thia aimplifiea 
the debugging proceaa a great deal* 
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Om dMirabU ebaracterittie amtiQBad ^aviaaaly vaa that of aaay 
fault datactiaa* A aiaipla typa ^ fault-dataation «aa U^laMnitad on tba 
8upar*65« Thia eoaaiata of oo^ariag tte ad^aaa mi tha Sbarad Addraaa lua 
with tha eurraat addaaat on aaeb of tba paooaaaor addraaa boa* Zf tiMy ara 
not idanticai* a rad LBD ia lit on tha arronaooa proeaaaor board* nothing 
furthar ia dona* * Zt ia aaauaad that tha oparator will ebaarra tha problao 
Boon aftar it oceura and taka tha propar atapa to prouMit faulty proe- 
aaaor froo eontMiiaating tba antira ayatao* 

Aa oantionad aarliar* tiia Appla 11 oaaa Afnmic RAM and aa thia oasory 
ia dividad into 16K by 1 bit ehipot it ia iapoaaibla to traat dia looaat IK 
of RAM on tha Appla diffarantly than tiia natt lovaat 1$K of RAM* Banco* 
tha Appla II 6502 oat plaead on a protoboard joat lika tha othar procaa- 
aora* In thia way all of tha procaaaora ^paar to ba pari^arala to tba 
A^la II. Bacauaa tha procaaaora appaar aa paripharala* idianawar ona of 
than baconaa tha it initiataa tha aquiralant of a DMA* If ona ama^aa 
tha achanatic diagr« of tha Appla II (Pigura 3*6} • ona notaa that tha lina 
DMA diaablaa tha buffara attachad to tha Appla II 6502*a addraaa boa* TUk 
alao diaablaa tha phaaa aaro clock input to the Appla II 6502* Thia fact 
forcaa ona to naka ona trivial nodification to tha Appla II in ordar to 
allow tha Apple II 6502 to operate whan it ia not tiia CP* Thia nodifica- 
tion ia to diaconnact tha phaaa aaro input fron tha AMO gate with which DMA 
ia able to diaabla tba phaaa aaro input and connect the phaaa aaro (Pq) to 
tha phaaa aaro (4 q) input on tba pari^iaral connactM* 

^ Appla II 6502 ia in control idian tba Appla 11 ia powered up* 
juat aa it would nomally ba. In ordar to daaignata another of tha procaa- 
aora aa tha CP* ona ainply addraaaaa tba nanory range $Cnxa» idiara n aquala 
tha procaaaor nunhar (1* 2* 3* or 7). Savon ia tha nunbar givaa to tha 
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Apple II 6502* When one ecMSMa Cmat (x ■ don't eort)t the Apple IX eoto- 
■eticelly eetivetee the line I/O Select on the nth peripherel elot. Thie 
•igoel ie then need to give the proceeeor on the boerd in elot n control of 
the ehered hueee. 

For neny epplicetione* it ie deeireble to loed the eene location in 
each proceeeore private nanory with different velnee* For exMtple* one nay 
nee that location at an indirect pointer* in which cnee each proceasor hae 
a different pointer. There are at leaat three different waye of achieving 
different valnee for different proceeeore. One way ie to have e hard-wired 
location in each proceasor* a private nanory. This location noat have 
different hard-wired values for each of the proceeaora. One can then 
nenipulate the differnoit values to obtain the desired differencea between 
procesaors. AnoAer method is to have each of the private nnsories napped 
into the Apple's nenory so that the Apple can loed different values direct- 
ly into each of the private nenories. 

The third nethod has ell of the proceeeore store the value for 
proceseor #1 into the location* then disable processor #1* store the value 

for processor #2 into the sane location (writing over the old value)* then 

{ 

disable processor #2* and so on until all of the processors have the 'appro- 
priate pointers. One can then restart all of the procesaors and proceed 
with execution of the progra. After aubstantial consideration* it was de- 
cided that this method provides sufficient versatility while not requiring 
as oMny additional components and board space. Realising that what was 
required was one signal that would aelectively deaignete a particular 
processor for disabling and another signal for restarting all the proces- 
sors simultaneously* and not wanting to burden the already saturated board 
with nore decoding components* it was decided to use already decoded 
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signalt provided by the Apple II. One reeel la that eedi of the peripherel 
slots contains three unusual signal at I/O Select • Device Select and I/O 
Strobe. I/O Select is used to designate which processor is the CP. As 
described earlier* Device Select goes low (active) when the address on the 
Shared Address Bus is in the range $C0N0~$C0NF* where N equals n * B (n ^ 
processor number). This signal then is well-*suited for disabling a partic* 
ular processor. l/O Strobe is contained on all of the peripheral slot 
connectors and goes low (active) on all of the slots simultaneously* if the 
address on the Shared Address Bus is in the range ^CSOO'SCFFF. Thus* I/O 
Strobe may be used to restart all the processors simultaneously after the 
initialization routine. 

The design of the Super-65 allows one to e^and up to a total of seven 
processors without any additional hardware being added to the basic system. 
Furthermore* each processor board ia a replica of the previous boards. In 
this way* one can realistically modify any Apple II to provide it with the 
capability of a seven processor array without extensive hardware alteration 
of the Apple II itself. The architecture itself does not limit the number 
of processors to seven. The Apple II only provides seven available slots 
with the necessary decoding for the special signals used in this design. 

If one desired to implement a larger array* one simply needs to provide the 
extra decoding circuitry and peripheral slots external to the Apple II. 

3,4 Design of the Individual Prooessor Card 

One recalls from the previous section (Figure 3.5) that the Super-65 
architecture provides for Shared Memory (SM)* a Hardware Arbitrator* four 
processors each with its own private memory and input/outpn'c capability. 

As mentioned earlier* the Shared Memory resides on the Apple II main board. 
The Hardware Arbitrator ia distributed across each of the processor cards. 
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Each processor card contains: 

(1) A 6502 microprocessor 

(2) A IK X 8 bit RAM (i.e.t 2-2114 memory chips) 

(3) Two input/output ports (i.e.t 1-6522 VIA) 

(4) A tristate buffer for the 6S02's address lines (i«e*t 2-74LS245) 

(5) A tristate buffer for the 6502*s data lines 

(6) A tristate buffer for certain control signals 

(7) Decoding circuitry (i.e.» 2-74LS155) 

(8) Required logic for implementation of special features (i«e«» 
74LS00. 74LS04. 74LS05* 74LS30. 74LS74. 74LS76. 74LS85 and 
74LS121) (see Figure 3.7)* 

The power-up procedure resets the 6502a and causes it to jump to a 
reset vector location. This location is in SM. The enable to the data bus 
buffer will be active for only two cases: one a during a READ from Shared 

Memory (this allows all processors to READ simultaneously) and twoa during 
a WRITE operation when the particular processor is the controlling proces- 
sor (CP). At all other timesa the data bus of the 6502 is disconnected 
froTii the Shared Data Bus. The enable to the address bus buffers and the 
READ/WRITE line and DMA line buffers is active only when the specific proc- 
essor has been designated as the CP. Thisa then allows a given processor 
to control the Shared Address Bus a R/W and DMA lines only ^en that proces- 
sor is the CP. 

As described previously a I/O Select goes low (active) on peripheral 
connector n when the address on the Shared Address Bus is in the range 
$OzOO-$CnFF (n = processor number). This signal is used to set a flip-flop 
on the 74LS76 designated as the CP flip-flop. The 'MAMD* of the I/O Select 
signals from all the other peripheral connectors is used to reset the CP 
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Figure 3.7 Processor card schematic diagram 
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flip-flop. Thus* when one of the other processors is designated as CP* the 
signal which sets its CP flip-flop will simultaneously reset all other CP 
flip-flops* Since the I/O Select signal remains active for only one-half 
of a machine cycle* it was necessary to use the 7 MHz signal available on 
the peripheral connector as the clock input to the CP flip-flop. 

The other flip-flop on the 74LS76 is designated as the Disable flip- 
flop and is used to disable the phase zero (4 >q} clock input to the 6502. 

As described in the preceding section* Device Select is used to disable the 
desired processor and I/O Strobe is used to restart all the processors 
simultaneously. Thus* Device Select is used to set the Disable flip-flop 
and I/O Strobe is used to reset the Disable flip-flop. As the Disable 
flip-flop output is simply used as a control input to on and gate with the 
phase zero clock as the other input* resetting the Disable flip-flop would 
immediately allow the phase zero clock to be applied to the 6502 clock 
input. Since there is no assurance what the 6502 will do when its clock is 
removed* the most reliable method of restarting all the processors* is to 
perform a hardware reset. That is* to pull the Reset line of all the 
processors low (active) simultaneously* Since the 6502 requires the Reset 
line remain low for several machine cycles in order to execute a valid 
reset* it was necessary to have the I/O Strobe line fire a monostable 
multivibrator (i.e.* 74121)* that was designed to hold the Reset line low 
for at least several machine cycles. In order to have sufficient time for 
all the processors to become synchronized* the monostable was configured to 
allow a delay of approximately one second. This delay is more than suffi- 
cient and could likely be reduced to one millisecond without causing any 
complications. 

Two 74LSl55s decode the address lines of the 6502 to determine which 
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IK of memory the address is making reference to* If the address is in the 
lowest IK of memory • the private memory is enabled* else the inverse of the 
private memory enable is used to designate that a Shared Memory access is 
requested. 

Four 74LS85 (4-bit comparators) are used to compare the address lines 
of the 6502 with the address lines of the Shared Address Bus. As noted in 
the Apple II Reference Manual* the address on the address bus becomes valid 
about 300 nanoseconds after phase one goes high and remains valid through 
all of phase. Since phase one is the complement of phase zero* when phase 
zero goes high* phase one goes low* Also* since the address is valid when 
phase zero goes high and the 74LS74 contains two negative-edge triggered 
flip-flops* the design compares the upper and lower bytes of the address 
separately and clocks in the result of the comparators on the falling edge 
of phase one. Thus* one is able to determine which* if any of the proces- 
sors is out of step with the others and which byte or bytes of the address 
is different from that of the CP. 

Finally* provision has been made for the inclusion of a 6522 Versatile 
Interface Adaptor (see Figure 3.8) which contains two 8-bit parallel ports* 
on each of the processor boards. Implementation of the 6522 will simply be 
a matter of deciding what address one would like for the I/O ports and 
various other control words of the 6522 to reside and then providing the 
necessary decoding circuitry for the 6522. While this is not trivial* it 
is straightforward and as the main thrust of this work is not actually the 
hardware implementation* the configuration of the 6522 is left for future 


research 
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Figure 3.8 Processor card layout. 











57 


4. EXAMPLES OF IRDEPERDERT DATA BARDLIN6 

4.1 Introduction 

At detcribtd in Section 2.2» independent date handling allona each 
processor its ovn source of data. By definition* the data sources are 
independent of each other. Of course* there are many applications in which 
the data arc not completely independent. However* a less precise but more 
practical way of differentiating between independent and dependent data 
handling methods* it to differentiate between applications for which the 
array is processing a separate problem for eacl processor (independent data 
handling) and applications where the array is processing different segments 
of the same problem for each processor (dependent data handling). 

This chapter deals with several examples of independent data programs 
that are written in Context Independent Code. One will note that independ- 
ent data prograns are usually simpler and more efficient than dependent 
data prograns that perform the same function. This is because independent 
data programs take advantage of more inherent parallelism and require less 
overhead of interprocessor communication than dependent data prograns. 

4.2 8^ Ht Magnitude of Twos- Complement Number 

The following program calculates the magnitude of an 8 -bit Twos- 
Complement Number. This program assumes that each of the processors has 
its own 8-bit number stored in a location in page sero of private memory 
designated by the symbolic name* number. The magnitude of the twos 
complement number stored in location NUMBER is placed in location MAGN at 
the end of the progran. 

MAGNITUDE OF TWOS COMPLEMENT NUMBER/CONTEXT INDEPENDENT CODE 
LABEL MNEMONIC OPERAND CYCLES COMMENT 


BEGIN: LDA-Z 


NUMBER 


3i GET NUMBER 
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MOHONIC 

QPBRAHD 

craBs 

COMMENT 

AID 

#80 

3t 

NASB OFF ALL BDT SIGH BIT 

CIC 


2t 


ROL 

ACC 

2| 

MOVE SIGH BIT INTO L8B POSITION 

Ra 

ACC 

2| 


8TA-Z 

TBfP 

3t 

STORE SIGN BIT (FOR END ARODND CARET) 

LM 

#0 

2s 


SBC 


2S 


SBC 

TEMP 

3S 

IF SaO. ACCbOs SbI. ACOFF 

BOR 

miMBBR 

3S 

IF SbO* ACCsNDMBER* Ssl* ACCbRUMBER 

CIC 


2S 


ADC 

TBfP 

3S 

ADD END AROOND GARRY 

STA-r 

MAGM 

3S 




33s 

TOTAL MACHINE CYCLES 


To dotonino the aagnitude of e twoe conplenent miaber* one aust de- 
cide first if the nunber is positive or negstive. Of course t if the nunber 
is positive* one does nothing to it* If the nunber is negstive* one cslcu- 
Istes the tvos complement inverse of it by complementing it end then adding 
one to it. The Context Independent Code program stores the sign bit to be 
used as the end around carry* then subtracts the sign bit from sero to get 
either 00 or FF* The progrmn then uses the fact that a value exclusive- 
ored with all seros is that number (e.g* value = positive) end a value 
exclusive-ored with all ones is the complement of thst value (e.g. value s 
negative). Finally* the program adds the sign bit (SsO* if positives Ssl* 
if negative) to obtain the magnitude of the original twos complement 
number* 

4,3 5 X 8~Bit Multiplication 

The following program multiplies one 8-bit number by another 8-bit 
number to obtain a 16-bit product* This progra assumes that one value is 
already residing in s location called MPCND and the other value is already 
present in s location called MPUl* both of idiich are in page sero of pri- 
vate meaiory. The product is returned in two locations* PROD-L and FROD-H 
both of which are in page sero. 
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8 X 8-BIT MDLTimaTXOM/OOMXBa IMB PEi MinC OODI 


LABEL 

MBMOIIIC 

(MRARD 

CYCLES 

GQnnn 

BB6m 

LQA 

#00 

Zt 

LOAD nOIBDZATB ZERO 


8ZA-Z 

FROD-L 

3t 

aEAR FRODDCr LOR BYTE 


8TA-Z 

PROD-R 

3t 

CLEAR PRODDCT HIQH ERE 


LDX 

#08 

2t 

SET BIT COURT > 8 BITS 

LOOPt 

ASL 

PROD-L 

3t 

SHIFT LEFT PRODUCT LOR BYTE 


ROL 

PROD-H 

5t 

ROTATE LEFT PRODUCT HIGH BYTE 


ASL 

MPLR 

it 

SHIFT LEFT MDLTIPLIBR 


LOA 

#00 

U 

SUBTRACT GARRY BIT FROM ZERO TO OBT4IR 


SBC 

#00 

2t 

BZTHBR 00 (C>1) OR FF (C«0) 


BOR 

IFF 

2| 

OOllFLBMRHT PREVIOUS RESULT 


ARB 

MPCHD 

3t 

ARD nTHER 00 (CARRICO) OR FF (GARRT>1) 


8TA-Z 

TEMP 

3| 

RITR MULTIPLICARD 


CIC 


2| 

TEMP > EITHER 00 OR MULTIPLICARD 


ADC 

PROD-L 

3| 

ADD EITHER ZERO OR MULTIPLICARD TO 


STA-Z 

PROD-L 

it 

SHIFTED PARTIAL PRODDCT LOR ERE 


LDA-Z 

PROD-H 

it 



ADC 

#0 

2t 

ADD POSSIBLE CARRY TO PRODDCT HIGH ERE 


8TA-Z 

PROD-H 

3t 



DBZ 


2t 

DBCREMBRT BIT COURT 


BHB 

LOOP 

2t 

DORB? IF ROT* LOOP 

BHDt 

RTS 


6s 





392s 

TOTAL MACHIRB CYCLES REQUIRED 


This progra uMt the xlgoritln of shifting ths partial product left 
ones* then adding the nultiplieand to the partial product if the tested bit 
of the aultiplier ia aet. If the tested bit of the anltiplier is aero* the 
original algorithn would branch around the add instruction and loop hack to 
teat the next bit of the multiplier. Because this progra ia written in 
Context Independent Code which does iu>t allow conditional branches for 
which the condition may be different for differmt processors* instead of 
bmching around the add instruction* this progra adds aero wha the 
tested bit of the aultiplier is aero. 

4.4 16/8~Bit Binary Division 

The following progra is the raeraa of the multiplication algorithm. 
That is* this progra takes a 16 -bit dividend in two locations called 
DVND-L and DVRD-H ad dividea tha by a 8-bit vela stored in a location 
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called DVSR« All memory locations are assusied to be in page zero of each 
processors private memory. The S-bit quotient is returned in a memory 
location called QNT and the 8*bit remainder is returned in location RMDR* 
16/8*BIT DIVISION/CONTEXT INDEPENDENT CODE 


LABEL 

MNEMONIC 

OPERAND 

craEs 

COMMENT 

BEGIN: 

LDX 

m 

2; 

NUMBER OF BITS IN DIVISOR = 8 


LDA-Z 

DVND-L 

3: 

GET LSB DIVIDEND 


STA-Z 

QNT 

3; 

STORE LSB DIVIDENT IN QUOTIENT 


LDA-Z 

DVND-H 

3: 

GET MSB DIVIDEND 


STA-Z 

TEMP 

3: 


DIVID: 

ASL-Z 

QNT 

5; 

SHIFT DIVIDEND-QUOTIENT 


ROL-Z 

TEMP 

5: 

LEFT ONE BIT 


CMP-Z 

DVSR 

3; 

GAN DIVISOR BE SUBTRACTED? 


LDA 


2; 



ROL 

ACC 

2; 

GET CARRY BIT INTO LSB OF ACCUMULATOR 


STA-Z 

SFLAG 

3: 

STORE SUBTRACT FLAG BIT 


EOR 

m 

2* 

COMPLEMENT FLAG BIT* (O^SUBTRACT ZERO 





1=SUBTRACT DIVISOR) 


ADC-Z 

QNT 

3; 

INCREMENT QUOTIENT IF DIVISOR COULD BE 





SUBTRACTED 


STA-Z 

QNT 

3; 

STORE NEW QUOTIENT 


LDA-Z 

SFLAG 

3; 

GET SUBTRACT FLAG 


ROR 

ACC 

2: 

ROTATE SUBTRACT FLAG BIT TO BORROW 





POSITION 


SBC 

#00 

2: 

ACCUMULATOR = (00 IF B=l* FF IF B=0) 


AND-Z 

DVSR 

3; 

ADD DIVISOR WITH EITHER FF OR 00 


STA-Z 

SFLAG 

3; 

STORE EITHER 00 OR DIVISOR 


SEC 


2: 



LDA-Z 

TEMP 

3J 



SBC-Z 

SFLAG 

3S 

SUBTRACT EITHER 00 OR DIVISOR FROM 





DIVIDEND 


DEX 


2; 

LOOP UNTIL ALL 8 BITS ARE PROCESSED 


BNE 

DIVID 

2; 



STA-Z 

RMDR 

3; 

STORE REMAINDER 

END: 

RTS 


6i 





447; 

TOTAL MACHINE CYCLES REQUIRED 


This progroD uses the algorithm of shifting the dividend lift once* 
then executing a trial subtracting of the divisor* If the subtraction is 
possible* the quotient is incremented and the actual subtraction executed. 
As in the multiply progrsm* this program does not use a conditional branch 
to determine if the subtraction should be done. Instead* the subtract flag 
determines whether one is subtracting zero or the divisor and whether zero 


or OM ii oddod to tho quotient 

4*5 Aoaumlation 
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The following progr«i eccunnUtet the eiai of 2S5 worde* each 32 bite 
long. The progrM eeeunee the date reaidea in the upper 1020 bptea of the 
firat IK of Mwory. The data are etored in four aectiona with baae ad* 
dreaaea» g48B*0» I48E*1» I4SB*2« and K48B*3. Thua» one ia able to acceaa 
all 2S5 worda by indexed addreaaing. The five byte reault ia returned ia 
the firat five bytea of page aero called eynbolically ACM*0« ACM*1« ACN*2a 
ACH*3» and ACM-4. 

32*BIT ACCUNOLATOR/CORTBXr OrDEPBIDBirT CODE 


LABH* 

mtamic 

OPERAND 

CYCLES 

COMIONT 

BBGZHt 

LEA 

#00 

2l 

CLEAR ACCUMJUTOR 8PACE8 


8TA-Z 

ACM-0 

3| 



ST4-Z 

ACM-1 

3t 



8TA-Z 

ACM- 2 

3| 



STA-Z 

ACM-3 

3t 



STA-Z 

ACM-4 

3| 



LDX 

#FP 

2{ 

NUMBER OP WORDS « 255 

LOOP: 

DEI 


2t 



LH4 

BA8E-0.Z 

4| 

CRT UB 


ADC 

ACM-0 

3| 

ADD TO ACCDMDUTOR ZERO 


8TA*Z 

ACM-0 

3l 



L04 

BA8E-1.X 

4| 

GET NEXT MOST 8IGNITICART BYTE 


ADC 

ACM-1 

3| 

ADD TO ACCDMUUTOR ONE 


8XA-Z 

ACM-1 

3t 



LEA 

1A8E-2.X 

4j 

GET NEXT MOST 81GNXFICA1IT BYTE 


ADC 

ACM-2 

3t 

ADD TO ACCUMUUTOR TWO 


8TA-Z 

AvM-2 

3t 



LEA 

BA8B-3.X 

4: 

GET MSB 


ADC 

ACM-3 

3| 

ADD TO ACCUMUUTOR TERSE 


LEA 

#00 

2t 



ADC 

ACM-4 

3t 

ADD POSSIBLE CARRY TO ACCDMDUTOR FOUR 


8TA-Z 

ACM-4 

3t 



CPZ 

#00 

2| 



BME 

LOOP 

2i 

ALL 255 NUMBERS ADDED? 


KT8 


6t 




16. 

>857 1 

TOTAL MAC8XNB CYCLES REQUIRED 

Tbia progri 

m ia very 

cloae 

to a atandard 6502 32-bit accumulation pro 

t,tm. 

The only poaaible change would be to only add the carry to ACM*4 
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whan is was sst. That to branch around tha add instruction whan tha 
carry was taro* Bowavart latar this progr« with indapandant data will ba 
conparad to tha aana application with dapandmt data. That ia« instead of 
parfoming a diffarant 32*bit accunulation for aach of four procassors* ona 
parfoms a singla 32*bit accumulation using all of tha four procasaors* 

4.6 32 X 32^Bit Binary Multipliaation 

Tha following program multiplias ona 32-bit nunbar by anothar 32-bit 
numbar to obtain a 64-bit product. Tha program assusms that tha multipli- 
cand alraa<^y rasidaa in four bytaa cal lad symbolically MPCD-0* MPCD-1* 
MPCD-2 and MPCD-3. Tbs progrm also assumas that tha nultipliar already 
rasidas in four bytas callad symbolically HFLE-O* MFlJl-1* MPUI-2 and 
MPLR-3* Tha 64-bit product is ratumad in sight bytas callad symbolically 
PRD-0. PED-l, PRD-2* PRD-3* PRD-4. PllD-3. FRO-6 and PRD-7. All mamcry 
locations ara asaumad to ba in paga saro. 

32 X 32-BlT HDLTIPLICATIOH/COIITEXI IRDEPENDQIT CODE 


LABEL 

MREHOHXC 

OPERAIID 

CYCLES 

COHMERT 

BECIH: 

LlBl 

#00 

2t 

aE4R PROOOa BYTES 0-7 


8TA-Z 

PRD-O 

3j 



StA-Z 

PRD-1 

3{ 



8TA-Z 

PRD-2 

3t 



8XA-Z 

FRD-3 

3t 



8TA-Z 

PRD-4 

3j 



8TA-Z 

PRD-3 

3; 



8TA-Z 

PRD-6 

3J 



8TA-Z 

PRD-7 

3t 



LOK 

#20 

2i 

32 CITS IV MJLTIPLIER 

SHIFT t 

A8L 

PRD-O 

it 

SHIPT PRODUCT BYTES 0-7 LEH ORE BIT 


ROL 

PRD-1 

it 



ROL 

PRD-2 

it 



ROL 

PRD-3 

it 



ROL 

PRD-4 

it 



ROL 

PRD- 5 

it 



ROL 

PRD-6 

it 



ROL 

PRD-7 

it 



A8L 

MPU-0 

it 

SHIFT HEXT BIT OF MULTIPLIER IVTO CARRY 


ROL 

KPU-1 

it 

PO81TI0H 


ROL 

MFLR-2 

it 




OMmT 


UBB. mSIONIC OmUUID CTCU8 


tOL 

MPUt-3 

5t 


UA 

#00 

2t 


IK 

100 

2t 


TCt 

#rt 

2t 

IP C>0» ACCDI^OO. IP Cel. ACCOMefF 

St/»-2 

MASK 

3; 

STORK MASK 

Affb 

NPCD-0 

3t 

IF C>0 MASK OFF MDLTIPLICAMD BTTB 0 

8TA-Z 

mp-o 

3s 

STORK KXXBKR NPCD-0 OR 00 

LDh-Z 

MASK 

3t 


AID 

HPCD-1 

3S 


8TA-Z 

ncp-i 

3S 

STORK niBBR KPCD-1 OR 00 

8EA-Z 

MASK 

3s 


AMD 

MPCD-2 

3S 


8XA-Z 

TNP-2 

3S 

STORK niRBR NPCD-2 OR 00 

LOA-Z 

IM8K 

3S 


AMD 

MPCD-3 

3t 


8TA-Z 

TMP-3 

3s 

STORE niHRR MPCD-3 OR 00 

CIC 


2s 


LM-Z 

PRD-0 

3S 


ADC 

lMP-0 

3S 

ADD EITHER 00 OR MPDC-0 TO PRODOCT 

8XA-Z 

niD-0 

3S 


LDA-Z 

PRD-1 

3S 


ADC 

7NP-1 

3S 

ADD EITHER 00 OR MPCD-1 TO PRODOCT 

8TA-Z 

PKD-1 

3t 


LDA-Z 

PKD-2 

3S 


ADC 

TMP-2 

3s 

ADD EITHER 00 OR MPCD-2 TO PRODUCT 

8TA-Z 

PMD-2 

3s 


LDA-Z 

PRD-3 

3s 


ADC 

TNP-3 

3S 

ADD EITHER 00 OR MPCD-3 TO PRODOCT 

8TA-Z 

PKD-3 

3S 


LOA-Z 

PRD-4 

3S 


ADC 

#00 

2s 

ADD POSSIBLE CARRY 

8TA-Z 

PKD-A 

3s 


LDA-Z 

PK5-5 

3S 


ADC 

#00 

?S 

ADD P0881BU CARRY 

8TA-Z 

PRD-3 

3t 


LDA-Z 

PRD-6 

3s 


ADC 

#00 

2; 

ADD POSSIBLE CARRY 

8U-Z 

FRD-6 

3s 


LDA-Z 

PRD-7 

3s 


ADC 

#00 

2S 

ADD POSSIBLE CARRY 

8TA-Z 

PRD-7 

3s 


DBZ 


2S 

ALL 32 BITS PROCBSSZD? 

8MB 

SHIFT 

2S 


KT8 


6S 




3306 s 

TOTAL MACBIMK CYCLES REQOIRBD 

This prograi 

sii^ly ei^ands the single byte nultiplication progr 


Context Indopoodont Code to that of • four bjto ■ultiplicotion. This re 
quires sbiltini of a product which is eight bytes rather than two. and 
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requires much nore temporary atorage but la a atraightforvard axtenaion of 
the sia^ler program. 

4.7 Comparieon of CIC Progrma With Unirrooeeeop Programe 

The fol loving program ealeulatea the atagnitude of an 8“bit twoa coo- 
plement number. This is a standard uniproceaaor prograa and hence is not 
in CIC. 


8 -BIT MgGMITDDE/UNIPROCESSOR CODE 


LABEL 

MNEMONIC 

OPERAND 

CYCLES 

COMMENT 

BEGIN: 

LDA-Z 

NUMBER 

3; 

GET NUMBER 


BPL 

END 

2s 

IF POSITIVE. DONE 


EOR 

#FF 

2S 

INVERT NEGATIVE NUMBER 


CIC 


2S 



ADC 

#01 

2; 

ADD END AROUND GARRY 

DONE: 

STA-Z 

MAGN 

3; 





14; 

CYCLES IF NEGATIVE 




8: 

CYCLES IF POSITIVE 


The uniprocessor program requires approximately 11 machine cycles on 
the average to obtain the magnitude of an 8-bit twos complement number. 

The CIC progran* on the other hand, alvays requires 33 machine cycles to do 
the same job. Thus, one must use 3 processors to obtain the same through- 
put as the original processor for this task. If one vere to represent the 
throughput of the array as Kn times the single processor throughput where 
n = number of processors and K is the 'recoding factor* • the recoding fac- 
tor for the 8-bit twos complement magnitude program is 0.33. This simply 
means that for this task, the number of processors used should always be 
greater than three for effective use of CIC prograassing. 

The following is the uniprocessor program from Levanthal [1979] that 
calculates the 16 -bit product of two 8-bit numbers. 
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8x8 MULTIPLIGATION/UKIPROCESSOR 


UBEL 

MNEMONIC 

OPERAND 

CYaES COMMENT 

BEGIN: 

LM 

00 

2: LSB OF PRODUCT « ZERO 


STA-Z 

PRD-H 

3 s MSB OF PROOUa = ZERO 


LDX 

08 

2s 8 BITS IN MULTIPLIER 

SHIFT: 

ASL 

ACC 

2 

SHIFT PRODUCT LEFT ONE BIT 


ROL 

PRD-H 

5 



ASL 

MPLR 

5 

SHIFT MULTIPLIER LEFT ONE BIT 


BCC 

NO ADD 

2 

NO ADDITION IF NEXT BIT IS ZERO 


CIC 


2 



ADC 

MFCD 

3 

ADD MULTIPLICAND TO PARTIAL PRODUCT 


BCC 

NO ADD 

2 



INC 

PRD-H 

5 

ADD CARRY TO MSB IF PRODUCT 

NO ADD 

: DEX 


2 



BNE 

SHIFT 

2 

LOOP UNTIL 8 BITS ARE MULTIPLIED 


STA-Z 

PRD-L 

3 

STORE LSBS OF PRODUCT 


RTS 


6 





208 

MACHINE CYCLES TYPICALLY REQUIRED 


Since the CIC progrm requires 392 machine cycles • the recoding factor 
for this progran is 0.53. Effective use of CIC programming requires that 
one employ a number of processors that is greater than two. One should 
note that the uniprocessor program requires only one accumulator for effi- 
cient processing. For example* the uniprocessor does not initially clear 
the LS3 of the product* nor does it store the result of the addition within 
the loop. Also* the LSB of the product is left in the accumulator, which 
allows it to be shifted much store quickly than if it were in page zero. 

All these facts allow the uniprocessor progran to be executed much quicker 
than the CIC program. If one were to have another accumulator available 
(as in the 6800)* the difference would be reduced to replacing the BCC 
instruction with LIX\ 00* SBC 00* EOR FF* and MFCND and STA-Z TEMP. The 
additional execution time would then be approximately 80 machine cycles or 
about thirty-eight percent longer. As one can see* the effectiveness of 
CIC programming depends as much on the expertise of the programmer as any 
other single factor. 
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The following !• e uniproceeeor 16/8-bit division progr«. 

16/8-BIT DIV18I0H/UN1FR0GES80R 


LABEL 

MNEMONIC 

OPERAND 

craES 

COMMBIT 

BEGIN: 

LDX 

m 

2: 

DIVISION BITS s 8 


LM-Z 

DVND-L 

3t 

GET L8B DIVIDEND 


STA-Z 

QNT 

3t 



LD4-Z 

DVND-H 

3; 

GET MSB DIVIDEND 

DIVID: 

ASL-Z 

QNT 

St 

SHIFT DIVIDEND-QUOTIENT LEFT ONE BIT 


ROL 

ACC 

2t 



CMP-Z 

DV8R 

3t 

GAN DIVISOR BE SUBTRACTED? 


BCC 

NO SOB 

2t 

NO. GO TO NEXT STEP 


SBC 

DV8R 

3t 

TES* SUBTRACT DIVISOR AND INCRDfENT 


INC 

QNT 

St 

QUOTIENT LOOP UNTU. ALL 8 BITS ARE 

NO SUB: 

DEX 


2t 

DIVIDED 


BNE 

DIVID 

2t 



STA-Z 

RMDR 

3t 

STORE REMAINDER 

END: 

RTS 


6t 
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TYPICAL MACHINE CYCLES REQUIRED 


The CIC progriD requires 447 nschine cycles snd thus the recoding fac- 
tor for this progran is 0.45. In this esse* three processors sre required 
to obtain a throughput greater than the throughput of a single processor 
executing uniprocessor code. 

The uniprocessor 32-bit accumulation program is identical to the CIC 
progrsm except that instead of adding a possible carry to the fifth byte* 
one inserts a BCC instruction which causes one not to execute the add if 
the carry is not set. This will reduce the execution tine by 2.5 nschine 
cycles on the average. For all practical considerations the CIC program 
executes as fast as the uniprocessor progrm and hence K - 1.0. 

The following is a uniprocessor 32 x 32-bit multiplication progran. 

32 X 32-BIT MDLTIFLICATIOIl /UNIPROCESSOR 


LABEL 

MNEMONIC 

OPERAND 

CYCLES 

COMMENT 

BEGIN: 

LD4 

m 

2t 



STA-Z 

PRD-0 

3t 

aEAR PRODUCT BYTES 0-7 


STA-Z 

FRD-1 

3t 



STA-Z 

PRD- 2 

3t 



STA-Z 

PRD-3 

3t 
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LABEL MNEMONIC OPERAND CYaSS 


COMMENT 


STA-Z 

8XA-Z 

STA-Z 

STA-Z 

LDX 

SHIFT: ASL 

RQL 
ROL 
ROL 
ROL 
ROL 
ROL 
ROL 
ASL 
ROL 
ROL 
ROL 
BCC 
CIC 
LDA-Z 
ADC 
STA-2 
LDA-Z 
ADC 
STA-Z 
LDA-Z 
ADC 
STA-Z 
LDA-Z 
ADC 
STA-Z 
BCC 
LDA-Z 
ADC 
STA-Z 
LDA-Z 
ADC 
STA-Z 
LDA-Z 
ADC 
STA-Z 
LDA-Z 
ADC 
STA-Z 

NO ADD DEX 
BNE 

END: RTS 


PRD-4 

3S 

PRD- 5 

3S 

PRD-6 

3: 

PRD-7 

3j 

#20 

2 

PRD-O 

5 

PRD-1 

5 

PRD- 2 

5 

PRD- 3 

5 

PRD-4 

5 

PRD- 5 

5 

PRD-6 

5 

PRD-7 

5 

MPLR-0 

5 

MPLR-1 

5 

MPLR-2 

5 

MPLR-3 

5 

MO ADD 

2 

2 

PRD-O 

3 

MPCD-0 

3 

PRD-O 

3 

PRD-1 

3 

MPCD-1 

3 

PRD-1 

3 

PRD- 2 

3 

MPDC-2 

3 

PRD- 2 

3 

PRD-3 

3 

MPCD-3 

3 

PRD-3 

3 

NO ADD 

2 

PRD-4 

3 

#00 

2 

PRD-4 

3 

PRD- 5 

3 

#00 

2 

PRD- 5 

3 

PRD-6 

3 

#00 

2 

PRD-6 

3 

PRD-7 

3 

#00 

2 

PRD-7 

3 

2 

SHIFT 

2 

6 

4450 

2032 


32-BITS IN MULTIPLIER 

SHIFT PRODUCT BYTES 0-7 LEFT I BIT 


SHIFT NEXT BIT OF MULTIPLIER INTO CARRY 
POSITION 


NO ADDITION IF NEXT BIT IS ZERO 

GARRY SET, ADD MULTIPLICAND TO PARTIAL 
PRODUCT 


ALL 32 BITS MULTIPLIED 

TOTAL CYCLES FOR ALL BITS SET 
TOTAL CYCLES FOR ALL BITS CLEARED 
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On the average* the required machine cyclea might be near the arith* 
metic average of the two extremes or approximately 3266. Since the CIC 
program requires S506 machine cycles for execution. The recoding factor 
for this program is approximately 0.59. Thus* two processors executing the 
CIC progran would yield a greater throughput than a single processor exe- 
cuting the uniprocessor program. 

To summarize* the 8-bit magnitude CIC program has K ~ 0.33. The 8x8- 
bit multiplication progran written in CIC has K - 0.53. The 16/8-blt 
division progran in CIC has K s 0.45. The 32-bit accumulator progran in 
CIC has K ~ 1.0 and finally the 32 x 32-bit multiplicatiC' progran has 
A = 0.59. The important thing to remember is that once the price has been 
paid by writing the progran in CIC* one can gain throughput linearly with 
additional processors. The experimental evidence here shows that the 
recoding factor* K* varies from a maximum of 1.0 (32-bit accumulation) to a 
minimum of approximately 0.45 (16/8-bit division). Thus* if an array of 
processors were to execute these sample prograns* one would see a through- 
put somewhere between 0.45n and n times the throughput of a single 
processor. This indicates that each PE has an efficiency of at least 452 
for these sample programs. 
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5. EXAMPLES OF DEPENDENT DATA HANDLING 

5 . 1 Introduction 

This chapter exhibits different examples of dependent data handling* 
Applications of dependent data handling must have sufficient inherent par- 
allelism vithin them to allow each of the processors to process a different 
segment of the entire problan. For this reason* dependent data problems 
are typcially larger and more complex than independent data problems. It 
is quite possible that a dependent data problem will have independent data 
subroutines used within it* This is because the calculation for which the 
subroutine is used does not require knowledge that the data for each of the 
processors is related in some way* There are two different methods of de- 
pendent data handling* The first method employs each of the 8-bit proces- 
sors to process 8-bit data and communicate to some of the other processors 
certain results of its processing* The other method uses all of the n 
processors to process 8n-bit data spread across all of the processors* 

This method needs a higher degree of communication between the processors 
than the other method as the processors are being used to simulate a single 
more powerful processor with a word size of 8n-bits. 

5*2 Carry-Propagation Problem 

As suggested in the preceding section* when one teams up several 
processors to simulate a larger processor several obstacles appear* One of 
the most difficult to resolve is that of a carry being propagated from one 
processor to another* In multiple precision arithmetic* a carry out of the 
most significant bit position is placed into what is called the carry bit* 
This bit is then added to the least significant bit of the next word* This 
works well when one is using a single processor* However* when one is 
using several processors* the carry from the most significant bit of one 
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processor should be sdded to the least significant bit of the next proces* 
sor. One problem is that the carry*out is not available on an external pin 
(except bit-slice microprocessors) and there exists no carry in pin so that 
one could join general-purpose microprocessors together in much the same 
manner as digital systems designers have previously joined several adders 
together to form a large adder. Even if the pins did exist* one would 
encounter a carry propagation problem similar to that encountered by digi- 
tal designers. The solution in that case was to use carry look ahead adder 
cells. Another solution to this carry propagation problem is discussed in 
the following section. 

5.3 Stored-Coppy Solution 

The preceding section described the carry propagation problem and 
examined a possible hardware solution to the problqn other than redesign of 
the microprocessor. One solution is to transfer the carry from each of the 
processors to the SM and then transfer the proper carry to the next proces- 
sor. This solution is inefficient because every addition requires at least 
one WRITE to SM and one READ from SM. 

The original solution of transferring each carry through SM is modi- 
fied so that it is necessary to transfer one word containing several car- 
ries to the next processor only after 255 additions. This 'Stored Carry' 
method assumes that a large number of values are to be added. Each proces- 
sor performs double precision arithmetic in adding 255 values together. 

The upper byte then contains the carries from the last 255 additions (a 
maximum of 255). One then transfers the stored carry word to the next 
processor through SM and adds it to that processor's accumulated value. 

This method requires only one WRITE to and READ from SM for 255 additions 
and is therefore much more efficient than the other method. 
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5.4 32~Bit Accumulation 


The following prograa calculates the sum of 235 thirty-two-bit words. 
The program aasumea that six locations in page zero have been previously 
initialized to the following values. 




PROCESSOR 


PROCESSOR PROCESSOR 

PROCESSOR 

LOCATION 

0 


1 2 

3 

FA 


00 


01 02 

03 

FB 


00 


00 00 

00 

FC 


00 


00 00 

00 

FD 


BO 


B1 B2 

B3 

FE 


00 


00 00 

00 

FF 


C7 


Cl C2 

C3 



32 -BIT ACCUMULATION/CIC (DEPENDENT DATA) 


LABEL 

MNEMONIC 

OPERAND 

CYCLES COMMENT 


BEGIN: 

LDA 

//oo 

2 




STA-Z 

SUM 

3 

CLEAR SUM LOCATION 



STA-Z 

CARRY 

3 

CLEAR STORED CARRY BYTE 



LDX 

//FF 

2 



LOOP; 

DEX 


2 




LDA 

BASE.X 

4 

GET NEXT VALUE TO BE ADDED TO 

SUM 


CIC 


2 




ADC 

SUM 

3 




STA-Z 

SUM 

3 




LDA 

//oo 

2 




ADC 

GARRY 

3 




STA-Z 

CARRY 

3 




CPX 

//OO 

2 

ALL 253 VALUES SUMMED? 



BNE 

LOOP 

2 

IF NOT. LOOP 



LDA-Z 

GARRY 

3 

LOAD ALL STORED CARRY WORDS 



STA 

(FC),X 

6 

STORE CARRY WORD FROM CPU-0 



INX 


2 




STA 

SClOO 

4 

SET CP=1 



STA 

(FC).X 

6 

STORE CARRY WORD FROM CPU-1 



INX 


2 




STA 

SC2O0 

4 

SET CP=2 



STA 

(FC),X 

6 

STORE CARRY WORD FROM CPU-2 



LDA 

SBOOO 

4 

GET CARRY WORD FROM CPU-0 



STA-Z 

$0001 

3 

STORE CARRY FOR CPU-1 



LDA 

$B100 

4 

GET CARRY WORD FROM CPU-1 



STA-Z 

$0002 

3 

STORE CARRY FOR CPU- 2 



LDA 

$B200 

4 

GET CARRY WORD FROM CPU- 2 



STA-Z 

$0003 

3 

STORE CARRY FOR CPU- 3 



STA 

$C700 

4 

SET CP=0 



LDX 

//OO 

2 




STX 

$0000 

3 

CLEAR CARRY WORD FOR CPU-0 
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LABEL MNEMONIC 

OPERAND 

CYCLES 

COMMENT 

LDA 

(FA)*X 

6; 

LOAD ALL GARRIES SIMULTANEOUSLY 

CIC 


2; 


ADC 

SUM 

3* 

ADD GARRIES TO SUM 

STA-Z 

SUM 

3t 

STORE RESULT 

STA 

(FC),X 

6S 

TRANSFER SUM BYTE FROM CPU-0 

STA 

$C100 

4; 

CP*l 

STA 

(FC),X 

6; 

TRANSFER SUM BYTE FROM CPU-1 

STA 

$C200 

4; 

CP=2 

STA 

(FC),X 

6t 

TRANSFER SUM BYTE FROM CPU- 2 

STA 

$C300 

4; 

CP=3 

STA 

(FC),X 

6: 

TRANSFER SUM BYTE FROM CPU- 3 

LD4-Z 

GARRY 

3t 


ADC 

m 

2; 

ADD POSSIBLE GARRY TO CPU-3s CARRY WORD 

INK 


5; 


STA 

(FC)*X 

6j 

STORE MSB OF ACCUMULATION 



7034; 

TOTAL MACHINE CYCLES REQUIRED 


The preceding progran performs 16''bit addition in accumulating 255 
eight-bit words for each processor. Each processor then has a sum byte and 
a carry byte. The carry bytes are then transferred through the SM to the 
next processor. The carry word from the previous processor is then added 
to the sum byte of the current processor. Finally* each of the sum bytea 
is transferred to the SM with CPU-3 transferring both the sum byte and the 
carry byte in order to complete the 5 bytes necessary to accumulate 255 
thirty- two-bit numbers. One notes that it is necessary to perform four 
WRITEb to and READS from SM in order to transfer the carry word from each 
processor to the next processor. This is due to the architecture chosen 
which allows interprocessor communication only through SM. The process of 
transferring carry words to each of the processors and transferring the 
results to SM takes 139 machine cycles. If the architecture permitted the 
transfers to be done in one pass instead of four* the transfers would only 
have taken 35 machine cycles. This represents a savings of 104 of a total 
of 7034 machine cycles required for the accumulation* only 1.5 percent. 
However* one should not lose sight of the potential problem when the number 
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of procestort beconet large. If one doea not want the aaount of tine 
required for tranaferring data between proceeaora to take more than 10 
percent of the entire progran* one cannot uee this architecture for more 
than approximately twenty processors. That is because the transfer re* 
quires 35^ machine cycles and the entire program takes approximately 7000 
machine cycles. Thus* for k greater than twenty* the transfer alone will 
require more than 10 percent of the array's time. However* this result 
assumes than n-byte arithmetic is used and is is unlikely that one would 
ever require 20*byte precision. 

5.5 X 3o~Bit Multiplication 

This progran multiplies one 32*bit number by another 32-bit number to 
obtain a 64-bit product. The multiplicand is assumed to be located in four 
bytes called DO* Dl* D2 and D3 situated in page zero for all of the proces- 
sors. The four bytes of the multiplier are distributed among each of the 
processors. That is* CPU-0 has RO* CFU-1 has Rl* CPU-2 has R2 and CPU-3 
has R3. When the progran refers to R* it is referring to the respective 
byte of the multiplier which each processor has* The progran has each 
processor multiply its multiplier byte by the low byte of the multiplicand. 
This product is placed into two locations called SO and SI. The multiplier 
byte is then multiplied by the second byte of the multiplicand. The low 

byte of this 16-bit product is added to SI and the high byte is placed in 

S2. The third byte of the multiplicand is multiplied by the multiplier 
byte. The low byte of this product is added to S2 and the high byte is 
placed in S3* The fourth byte of the multiplicand is multiplied by the 
multiplier byte. The low byte of this product is added to S3 and the high 
byte of the product is placed in S4. Each processor then transfers its 

S0-S4 words to SM. The partial sums S0-S4 are stored in a shifted manner 
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to indicate thair vaightinga (Figuraa 5*la and S«lb)« All the partial auns 
are read into each proceaaor'a page aero. Then the X index regiater of 
each proceaaor ia initialiced to a different value ao that each proceaaor 
indexea to different partial auaia for accuanilation. For example. CFU-0 
adda ita 80 to zero and atorea the reault in Lo'Accum. It then adda ita SI 
to CPU-1* 81 with carry and atorea thia in Mid-Accum. CPU-0 adda aero to 
Lo-Accum and to Mid-Accum twice more to get the final reault of Lo-Accum 
and Mid-Accum. In order to account for poaaihle carriea from one proceaaor 
to another, zero ia added with carry to a null location called Hi-Accum. 

Of courae. while CPU-0 ia doing thia. the other proceaaora are acciimulating 
their own Lo-Accum. Mid-Accum and Hi-Accum from their own indexed data. 

The only thing left to do ia to tranafer the atored carry Hi-Accum byte to 
the next proceaaor and add it in to obtain the final 64-bit product. Thia 
program aaaumea the aame initialization aa the 32-bit accumulator program. 

32 x 32-BIT MULTIPLY/CIC (DEPENDENT DATA) 


LABEL 

MNEMONIC 

OPERAND 

CYCLES 


COMMENT 

BEGIN: 

LDX 

f>7 

2: 




LDA 

//ZERO 

2s 




STA-Z 

E8 

3; 

INITIALIZE 

BASE POINTER ZERO 


LEA 

//ONE 

2: 




STA-Z 

E4 

3s 

INITIALIZE 

BASE POINTER ONE 


LD4 

//TWO 

2S 




STA-Z 

EC 

3S 

INITIALTZE 

RASE POINTER TWO 


LDA 

)/ THREE 

2S 




STA-Z 

EE 

3S 

INITIALIZE 

BASE POINTER THREE 


LDA 

m 

2S 




STA 

(E8).X 

6S 

CLEAR NULL 

LOCATIONS 


STA 

(EA).X 

6S 




STA 

(EC).X 

6S 




DEX 


2S 




STA 

(E8).X 

6s 




STA 

(EA).X 

6S 




DEX 


2S 




STA 

(E8).X 

6S 




LDX 

//2 

2S 




STA 

(EE).X 

6S 




DEX 


2S 




I 
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D3 

02 

01 

00 


R3 

R2 

Rl 

RO 



ROxOO 



ROxOI 



ROx 02 



1 R0X03 




S4 

S3 

S2 

SI 

SO 


MULTIPLICAND 

MULTIPLIER 


> 


PE 


0 




RlxDO 

RixOI 


R|x 

D2 


RIX03 


S4 S3 

S2 

SI 

so 


PE I 


1 

1 

( 

1 

1 

1 

R2XOO 

R2X0I 

R2xD2 ' ! 

R2XD3 


S4 S3 

S2 Si so 


I 

R3X DO 

1 

1 

R3X0I 

1 

1 

R3x02 

1 

R3x 03 



S4 S3 

mum 

SO 


PE3 


Figure 5.1a 32 x BZ-blt nulciplicatlon diagram. 
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LABEL 


mSMONIC 

OFZIAMD 

CYCLB8 

OOmCMT 

8TA 

(BB).X 

6t 


8TA 

(BC).X 

6t 


DU 


2| 


8T4 

(BB).X 

6| 


8TA 

(BC).X 

6t 


8TA 

(U).X 

6t 


;SlA-Z 

R 

3| 

GBT MULTZPLXn BYTB 

8"A-Z 

MPLR 

3| 

PA88 PARAM. TO 8-BXT MULTIPLY 8UBR0UTHIB 

L A-Z 

DO 

3t 

GBT LOH BYTB OF MULTXPLXCAMD 

8TA-Z 

KPCD 

3t 

PA88 PARAM. TO 8-BXT MULTIPLY BUBROUIXME 

J8E 

8 -BIT 

398 ( 

DO 8-BXT MULTIPLY OF R AHD DO 


MDinFLY 



LDA-Z 

nD-L 

3| 


8TA*Z 

80 

3t 

PLACE LOU BYTE OF PRODUCT XHTO 80 

LDA-Z 

PRD-H 

3t 


8TA-Z 

81 

3| 

PLACE HIGH BYTB OF PRODUCT INTO 81 

LDA-Z 

D1 

3t 

GET 2HD BYTB OF MULTXPLXCAMD 

STA-Z 

MPCD 

3t 

PA88 PARAM. TO 8-BXT MULTIPLY 8UM0DTXME 

J8R 

8-BIT 

398 1 

DO 8-BXT MULTIPLY OF R AND D1 


NDLTXPLY 



CIC 


2t 


LDA-Z 

PRD-L 

3t 


ADC 

81 

3t 

ADD LOU BYTE OF R X D1 TO 81 

8XA-Z 

81 

3t 


LDA 

#00 

2t 


ROR 

ACC. 

2t 


8TA-Z 

CARRY 

3t 

8TOU F08SXBLE CARRY 

LOA-Z 

PRD-H 

3{ 


8TA-Z 

82 

3t 


LIM^-Z 

D2 

3t 

GET 3RD BYTE OF MULTXPLXCAHD 

8TA-Z 

MPCD 

3; 

PA88 PARAM. TO 8-BXT MULTIPLY 8UBRGUTXHE 

JSR 

8-BXT 

398 1 

DO 8-BXT MULTIPLY OF R AMD D2 


MULTIPLY 



LDA*Z 

CARRY 

3t 


A8L 

ACC. 

2i 

PLACE P088XBLE CARRY BACE 

LOA>Z 

PRD-L 

3t 

GET LOU BYTE OF PRODUCT R X D2 

ADC 

82 

3t 


8TA>Z 

82 

3t 

ADD LOU BYTE OF R X D2 WITH CARRY TO 82 

LDA 

#00 

2t 


ROR 

ACC. 

2t 


8TA-Z 

CARRY 

3t 

8TOU POSSIBLE CARRY 

LDA'Z 

PRD-H 

3t 


8TA>Z 

83 

3t 

STORE HIGH BYTE OF R X D2 IS 83 

LDA'Z 

D3 

3t 

GBT 4TH BYTE OF MULTIPLICAND 

8TA-Z 

MPCD 

3; 

PASS PARAM. TO 8-BIT MULTIPLY 8UBB3UTXME 

J8R 

8-BIT 

398 1 

DO 8-BXT MULTIPLY OF R AND D3 


MULTIPLY 



LDA-Z 

CARRY 

3t 


A8L 

ACC. 

!t 

RETURN CARRY FROM LAST ADD 

LDA-Z 

PRD-L 

3t 


ADC 

83 

3t 

ADD LOU BYTE OF R X D3 TO S3 
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LABEL MNEMONIC 

OPERAND 

CYCLES COMMENT I 

STA-2 

S3 

3 

1 " 

STORE FINAL S3 ; 

LDA-Z 

PRD-H 

3 

t 

ADC 

#00 

2 

ADD IN POSSIBLE GARRY > 

STA-Z 

S4 

3 

STORE FINAL S4 ! 

LDX 

#04 

2 

t 

1 

STA 

(FC),X 

6 

TRANSFER S4 FROM CPU-0 

DEX 


2 


LDA-Z 

S3 

3 


STA 

(FC),X 

6 

TRANSFER S3 FROM CPU-0 

DEX 


2 


LDA-Z 

S2 

3 


STA 

(FC).X 

6 

TRANSFER S2 FROM CPU-0 

DEX 


2 

f 

LDA-Z 

SI 

3 

1 

1 

STA 

(FC).X 

6 

TRANSFER SI FROM CPU-0 1 

DEX 


2 

! 

LDA-Z 

SO 

3 


STA 

(FC),X 

6 

TRANSFER SO FROM CPU-0 

STA 

$C100 

4 

SET CP=1 

LDX 

#05 

2 


LDA-Z 

S4 

3 


STA 

(FC),X 

6 

TRANSFER S4 FROM CPU-1 

DEX 


2 


LDA-Z 

S3 

3 

* 

STA 

(FC),X 

6 

TRANSFER S3 FROM CPU-1 

DEX 


2 

1 

LDA-Z 

S2 

3 


STA 

(FC).X 

6 

TRANSFER S2 FROM CPU-1 

DEX 


2 


LDA-Z 

SI 

3 


STA 

(FC),X 

6 

TRANSFER SI FROM CPU-1 

DEX 


2 

1 

LDA-Z 

SO 

3 

1 

1 

STA 

(FC),X 

6 

TRA17SFER SO FROM CPU-1 

STA 

$C200 

4 

SET CP=2 

LDX 

#06 

2 

' 

LDA-Z 

S4 

3 

1 

STA 

(FC).X 

6 

TRANSFER S4 FROM CPU- 2 

DEX 


2 

■ 

LDA-Z 

S3 

3 


STA 

(FC),X 

6 

TRANSFER S3 FROM CPU- 2 

DEX 


2 


LD/V-7. 

S2 

3 


STA 

(FC),X 

6 

TRANSFER S2 FROM CPU- 2 

DEX 


2 


LDA-Z 

SI 

3 


STA 

(FC),X 

6 

TRANSFER SI FROM CPU-2 

DEX 


2 


LDA-Z 

SO 

3 


STA 

(FC),X 

6 

TRANSFER SO FROM CPU- 2 

STA 

$C300 

4 

SET CP-3 
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LABEL 


QLOOF: 


RLOOP; 


TLOOP: 


ULOOP : 


MNEMONIC OPERAND CYCLES 


COMMENT 


LDX 

LDA-Z 

STA 

DEX 

LOA-Z 

STA 

DEX 

LDA'Z 

STA 

DEX 

LDA-Z 

STA 

DEX 

LDA-Z 

STA 

STA 

LDX 

DEX 

LDA 

STA 

CPX 

BNE 

STA 

LDX 

DEX 

LDA 

STA 

CPX 

BNE 

STA 

LDX 

DEX 

LDA 

STA 

CPX 

BNE 

STA 

LDX 

DEX 

LDA 

STA 

CPX 

BNE 

LDA-Z 

ASL 

STX 

LDA 

CIC 

ADC 

STA-Z 

INX 


#07 2 

S4 3 

(FC).X 6 

2 

S3 3 

(FC),X 6 

2 

S2 3 

(FC).X 6 

2 

SI 3 

(FC).X 6 

2 

SO 3 

(FC).X 6 

$C700 4 

#05 2 

2 

(FC)»X 6 

(E8).X 6 

#00 2 

QLOOP 2 

$C100 4 

#06 2 

2 

(tO.X 6 

(EA).X 6 

#01 2 

SLOOP 2 

$C200 4 

#07 2 

2 

(FC).X 6 

(EC).X 6 

#02 2 

TLOOP 2 

$C300 4 

#08 2 

2 

(FC),X 6 

(EE).X 6 

#03 2 

ULOOP 2 

$FA 3 

ACC. 2 

2 

(E8)»X 6 

2 

(EA).X 6 

LO-ACCUM 3 

2 


TRANSFER S4 FROM CPD-3 

TRANSFER S3 FROM CPU- 3 

TRANSFER S2 FROM CPU-3 

TRANSFER SI FROM CPU-3 

TRANSFER SO FROM CPU- 3 
SET CP=0 

GET S0-S4 FROM CPU-0 FOR ALL CPUs 
SET CP=1 

TRANSFER S0-S4 FRCM CPU-1 TO ALL CPUs 
SET CP=2 

TRANSFER S0-S4 FROM CPU- 2 TO ALL CPUs 


SET CP=3 

TRANSFER S0-S4 FROM CPU-3 TO ALL CPUs 


GET INDEX REG X=0 FOR CPU-0, 2 FOR CPU-1 
4 FOR CPU- 2, 6 FOR CPU- 3 

ACCUMULATE 3 BYTES FOR EACH CPU, THAT IS, 
LO-ACCUM, MID-ACCUM AND HI-ACCUM 
HI-ACCUM = STORED CARRIES 
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i 

i 

l 

i 


label 


MNEMONIC 

OPEBAND 

CYCLES OONMEHT 

LDA 

(E8).X 

6t 

ADC 

(EA).X 

6: 

STA-Z 

MID-ACCDM 

3; 

LDA 

#00 

2: 

ADC 

#00 

2; 

STA-Z 

CIC 

HI-ACCDM 

3s 

2: 

LDA-Z 

DEX 

LO-ACCUM 

3; 

2: 

ADC 

(EC).X 

6: 

STA-Z 

INX 

LO-ACCUM 

3; 

2; 

LDA-Z 

MID-ACCDM 

3: 

ADC 

(EC).X 

4; 

STA-Z 

MID-ACCDM 

3s 

LDA-Z 

HI-ACCUM 

3S 

ADC 

#00 

2S 

STA-Z 

CIC 

HI-ACCUM 

3S 

2S 

LDA-Z 

DEX 

LO-ACCDM 

3S 

2S 

ADC 

(EE),X 

6S 

STA-Z 

INX 

LO-ACCUM 

3: FINISH ACCUMDUTION OF 3 BYTE SUMS 

2; 

LDA-Z 

MID-ACCDM 

3: 

ADC 

(EE).X 

6; 

STA-Z 

MID-ACCDM 

3s 

LDA-Z 

HI-ACCUM 

3; 

ADC 

#00 

2; 

STA-Z 

DEX 

HI-ACCUM 

3s 

2S 

LDY 

#00 

2S 

STA 

(FC),Y 

6s TRANSFER STORED CARRIES FROM CFU-0 

STA 

$C100 

4; SET CP=i 

STA 

(FC).Y 

6; TRANSFER STORED GARRIES FROM CPU-1 

STA 

$C200 

4; SET CP=2 

STA 

(FC),Y 

6; TRANSFER STORED GARRIES FROM GPU-2 

STA 

$C700 

4; SET CP^O 

LDA 

#00 

2s 

STA 

(E8),Y 

6; CLEAR STORED CARRY WORD FOR CPU-0 

LDA 

(FC),Y 

6; GET STORED GARRY WORD FOR CPU-1 

LDY 

#02 

2; 

STA 

(E8).Y 

6, GIVE S.C, WORD FROM CPU-0 TO ALL CPUs 

STA 

$C100 

4; SET CP^l 

LDY 

#00 

2; 

LDA 

(FC).Y 

6; GET STORED CARRY WORD FROM CPU-1 

LDY 

#04 

2: 

STA 

(E8).Y 

6; GIVE S.C. WORD FROM CPU-1 TO ALL CPUs 

STA 

$C200 

4; SET CP=2 

LDY 

#00 

2; 

LDA 

(FC),Y 

6S GET S.C. WORD FROM CPU- 2 
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LABEL 


MNEMONIC 

OPERAND 

CYCLES 

LDY 

P06 

2; 

SIA 

(E8)*T 

6; 

CIC 


2: 

LDA-Z 

LO-ACGUM 

3: 

ADC 

(E8)»X 

6: 

8TA-Z 

LO-ACCUM 

3: 

LDA-Z 

MID-ACCUM 

3: 

ADC 

#00 

2; 

STA 

MID-ACCUM 

3: 


2685: 


COMMENT 


GIVE 8.C. WORD FROM CPU-2 TO ALL CPU* 


ADD IM STORED CARRIES 


TOTAL MACHINE CYCLES REQUIRED 


From the above program* one can eaaily see how varioua independent 
data programs may be placed within a larger and more complex dependent data 
program* One example is the 8-blt multiply independent data subroutine 
which is used in the 32 x 32-blt multiply dependent data program below. 

5,6 Comparison of CIC Programs With Uniprocessor Programs 

The 32-bit accumulation (dependent data) progran requires all four 
processors* but is able to execute an accumulation of 255 thirty-two bit 
ntimbcrs in 7034 machine cycles whereas the uniprocessor prograo requires 
16*857 machine cycles to do the same job. Thus* the four-processor array 
can perform the 32-bit accumulation is 2.4 times as fast as the single 
processor* This represents a recoding factor of 0*60* These results would 
certainly encourage one to pursue array processing* An important fact to 
point out is that with the architecture used in this study* propagating 
carries from one processor to the next for 8>t-bit precision arithmetic* 
will take a larger and larger amount of time as the number of processors 
grows* However* if the architecture were modified to allow all the proces- 
sors to propagate their carries in one pass instead of n passes* this would 
not be the case* For this array of 4 processors* propagating the carries 
required less than 2 percent of the execution time* therefore* the archi- 
tecture did not significantly hamper throughput* With K = 0*60* each of 
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the microprocestors ie about 60 percent efficient. So long aa carry- 
propagation delay is not significant* one could expect a throughput of 0.6r. 
for an array of n processors executing this progrM. 

The 32 X 32-bit multiply (dependent data) progrm requires all four 
processors but is able to perform a 32 x 32-bit multiply in 2685 machine 
cycle as compared to the average execution time of 3266 for the uniproces- 
sor program* Thus* one is able to obtain a speed-up of 1.22 by using four 
processors. To determine the recoding factor for this progres* one calcu- 
lates the throughput of the array m Kn, In this case* the recoding factor 
K is roughly 0.30. If one examines this progran closely* it is apparent 
that almost 20 percent of ths time is spent in transferring values from one 
processor through SM to another processor* Thus* the architecture used to 
implement the four-processor array is seriously hampering the efficiency of 
this progran by requiring four times as long to pass parameters between 
processors. If one were to implement another architecture which would 
allow all the paraseters to be passed in a single sweep* the recoding fac- 
tor* K «Rmld reach approximately 0.40. In this case* an array of proces- 
sors could be expected to exhibit a throughput of roughly 0.4n. 


83 


I 

i 

6, SUMIMIIY AND SUGGESTIONS FOR FURTHER RESEARCH 

I 6.1 Surmxry 

! As stated earlier* the main objective of this vork was to assert that 

the recoding of standard uniprocessor programs into Context Independent 
Code programs is feasible for an important set of applications. This oh- 

I 

jective was achieved by implementing a four-processor array and recoding 
several programs for it. The programs were divided into two categories, 
independent and dependent data handling. The first category allowed each 
of the processors to work on a separate set of data such that no processor 
required any results from any other processor in order to complete its 
task. The second category required each of the processors to work on a 
subset of the entire problem. This meant that the processors needed to 
connminicate intermediate results with one another at various times in order 
to complete their task. 

In order to allow easier comprehension of the programs, the design 
of the Super-65 array processor was described in detail and alternative 
approaches were analyzed. The strengths and weaknesses of the implementa- 
tion chosen for the Super-63 were noted. Among the strengths were: its 

simplicity, both in processor-memory interconnection and interprocessor 
connections, expandability of the system, no host processor required and 
the fault tolerance potential of the design. Among the weaknesses of the 
design were: its restriction of interprocessor communications, the fact 

that only one processor is able to write to SM at a time and the fact 
that all progroDS executed on the Super-65 had to be written in Context 

j 

! Independent Code. The last weakness does not inhibit this study at all as 

the intent of this study is to examine the implementation of Context 
Independent Code. The first two weaknesses of the design did not affect 

i 

f 

[ 
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the independent data prograns nearly aa nuch aa the dependent data pro- 
grams. The effect of the specific architecture on the efficiency of the 
CIC prograns vas noted in the case of the dependent data programs and a 
possible alternative architecture was suggested. 

The conclusion drawn from the independent data programs was that for 
the set of recoded programs* if one were to have an array of n processors* 
the throughput of the array would range somewhere between 0*h5n and n times 
the throughput of a single processor. This means that the recoding factor 
of the sample programs ranged from 0.4S to 1.00. This is* of course* quite 
impressive and encouraging. 

The conclusion drawn from the dependent data prograns was that for 
the set of recoded programs* if one had an array of n processors* the 
throughput of the array would range between 0.3k and 0.6n times the 
throughput of a single processor. As expected* the dependent data prograns 
were considerably less efficient than the independent data programs with 
the recoding factor for the sample programs from 30 to 60 percent (see 
Table 6.1). 

It is thus apparent that the throughput of the array processor is 
highly dependent on the type of programs that it is executing. However* if 
one considers that it has been shown by this study to be possible to obtain 
a linear relationship between throughput and the number of processors in 
the array (so long as carry-propagation delay in minimal)* one must admit 
that Context Independent Code may provide the key to arrays of immense 
proportions. One may conclude that this study has shown that the implemen- 
tation of Context Independent Code is not only feasible for array programs* 
but is in fact desirable as it allows the array throughput to be linearly 
related to the array size. Limitations to the array size are not due to 
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TABLE 6.1 Recoding factors for the sample program 


Independent Data 


8*Bit Magnitude 0*33 
8 X 8*Bit Multiply 0*53 
16/8-Bit Divide 0.45 
32-Bit Accumulator 1.00 
32 X 32-Bit Multiply 0.59 


Dependent Data 


0.60 

0.30 
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the CIC progran but rather are due to the hardware reatrainta that one 
chooaea to inpoae* Of courae* if the array were to be infinitely large* 
the time delay from one end of the array to the other could become aignifi" 
cant. Context Independent Code further haa the property that the procea* 
aora never become unaynchronized once they are initialized becauae all the 
proceaaora are always forced to execute the exact aame instruction. That 
ia* none of the processors is allowed to be turned off during the execution 
of a CIC progran. 

6.2 The Ideal Mioroprooeeeov for an Array of Mioroprooeeeore 

It is not attempted here to define the ideal microprocessor for an ar* 
ray of microprocessors. Instead several desirable qualities that are found 
lacking in the microprocessor used for the Super~65 are described* as well 
as those features of the 6502 microprocessor which are extremely useful 
will be noted as well. 

The most vital feature of the 6502 is its indirect indexed addressing 
mode. This feature allows the processors to execute the aame instruction 
but locally index the effective address so that the processors actually 
access different memory locations at the same time. This property is 
essential as it allows one the ability to use pointers to point to the de- 
sired data locations. Also it allows one to index through data from a base 
location. Since a READ from SM is always executed by all of the processors 
values read from SM are stored in the same locations in all processors. 

One way to allow different processors to obtain different data while exe- 
cuting the same instruction is to use indirect indexed addressing where 
either the indirect value or the index value is a local value. 

Another important mode of addressing is Indexed Indirect Addressing 
where one can index through a tabic of pointers for different data. This 
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mode is not quite ss useful as the previous node but still provides the 
progrsmner a much nore versatile set of instuctions* 

One of the most distinctive features of the 6502 is its Zero**Page 
Addressing mode* This mode allows the programmer to accesa any of the 256 
locations of page sero very rapidly and thus allows one to use page aero in 
the same manner as a small cache memory* This addressing mode allows for 
considerable increase in throughput of the 6502 if used efficiently* How- 
ever* for many applications* 256 locations are insufficient to contain all 
the necessary data and for three cases* Zero-Page Addressing is not as at- 
tractive aa it could be* A modified Zero-Page Addressing mode may be much 
more useful for larger prograns* This modified Zero-Page Addressing mode 
can be called Designated Page Addressing* This mode requires an 8-bit page 
register that can be set to any of the 256 different pages in the 6502 
memory. In this way* one can designate which page of memory is desired to 
have fast access* This allows the microprocessor to execute at almost 
twice its regular speed as it would seldom be necessary to specify the high 
byte of each address* One executes a 'Set Page* instruction at the begin- 
ning of the program and then execute the bulk of the instructions from that 
page in memory* If it becomes necessary to cross into the next page or 
some other page of memory for a considerable number of instructions or 
data* one simply sets the page to a different number* Another benefit is 
that for stack-oriented code* the designated page may be set to that page 
of memory where the stack resides* This could allow one to access the 
stack very quickly for non-stack operations* Altogether* this designated 
page option is strongly recommended. 

One characteristic of the 6502 is that one cannot do memory-to-memory 
manipulations* That is* one must always route one of the operands by way 
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of the eecumuletor. Thie doee not el low one to keep eny t«iporery result 
in the eecumuletor and elao forcea the programer to uae more inatruetions 
to perform memory^to-memory trenafer* For thie reeaon* another eecumuletor 
may be desirable* particularly one which hea the full eepebilitiea of the 
original accumulator. Thia accumulator might be transparent to the pro- 
grammer such that the microprocessor is capable of memory''tO'memory manipu- 
lation without passing through an accumulator. 

Placing a microprocessor into an array system* especially the Super-65 
means extensive use of the index registers. More such registers* prefera- 
bly with general-purpose register capabilities of shifting* incrementing* 
decreoenting and the like could be used effectively. The addition of at 
least one general purpose register with the option of adding the contents 
of that register to the accumulator may resolve the temporary storage 
problem. 

In contrast to the MC6800* the 6502 does not have tristate capability 
on the chip. That is* the 6502 does not itself provide IMA capability. 
However* the architecture of the Super-65 could not have taken advantage of 
this capability had it been available. This is because each processor 
should always be able to access its private memory. This would not be the 
case if the tristate buffers for the address and data buses were on the 
microprocessor chip. If the microprocessor has 512 bytes of RAM on-board 
and tristate buffers on-board with control inputs to determine when the 
buffers should be tristated* one could reduce chip count on the proceasor 
board significantly. This reduction might nof. be worth the required addi- 
tional complexity of the microprocessor chip. However* with the onset of 
VLSI* the above option might be easily within reach. 

Most microprocessors have the capability of being baited for varying 
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aaounts of tine. Thie ie typically done by either a HALT eignal or by die* 
abling the clock input to the nicioproceaaor. Whan the clock ie diaabled* 
the noet reliable procedure ia to reeet the nicroproceaeore before proceed* 
ing. It ie deeirable to tenporarily cauee the nicroproceesor to execute 
no'Ope with the clock active ao that the nicroproceaaor renaina aynchro* 
nixed with the other proceaaora of the array. Thua» the capability to 
diaable inatruction decoding within the nicroproceaaor and force execution 
of no*opa until the inatruction decode diaable control input goea inactive 
would be quite uaeful in deaelecting certain proceaaora for a few inatruc* 
tiona. 

One final property that preaent nicroproceaaor a do not have ia the 
ability to tean the proceaaora eaaily to do multi-word arithnetic aa a 
aingle unit. In particular* there ia no method of propagating carriea from 
one procesaor to the next without loading the accumulator with zero and 
ahifting the carry bit into the accumulator* then atoring the accumulator 
where the next proceaaor can read it. Thia fact led to implementation of 
the atored-carry aolution. However* the atored-carry aolution worka rea* 
aonably well when aeveral additiona are neceaaary. When only one addition 
ia required the atored-carry approach ia extremely inefficient and unaatis- 
factory. One aolution would be to place carry in and carry out pine on 
each nicroproceaaor. Thia aolution would lead to lengthy carry-propagation 
delaya which would be unacceptable. A poaaible alternative would be to 
place carry-propagate and carry-generate inputa and outputa on each 
proceaaor. Thia nethod would require two additional pina but would allow 
the carry propagation delay tine to be aubatantially amaller than the 
preceding aolution. 
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6.3 Extending the Moroprooeeeor Array 

Th«r« «r« Mv«r«l obataeltt to axtanding tba •ieroproetaaor array to a 
vory largo no^or* fhoao obataclta art duo to tho hardwaro iaploBOBtatioo 
of cho 8ttpor*bS rathor than tbo isploioatation of Contoxt Indopondont Coda. 

Tho aoat aorioua iayainiMt ia that for an array of n proeoaaorat n 
WRITES to 8M and n READS froa 8M aro roquirod to paaa inforaMtion froa oarh 
procoaaor to tho noxt. In tho four-procoaaor array iaplaaontod* thia 
probloa waa not conapicuoua. Ono can roadily aoo that for a largor array* 
the porcontago of tiao apont aiaply coaaunicating botvoon procoaaora could 
rapidly bocoae unaccoptablo. Thoroforo* in order to extend the array eub- 
stantially* one ehould aodify tho interprocoeeor coaaunicatione to allow 
each procoaeor to coamunicate at leait to ite neareat noighbora by porfora** 
ing a eingle WRITE* Thia can bo done porhapa aoat aaaily by giving each 
procoaaor two apacial locationa within ita private memory* Whenever the 
procaaaor WRITES to one of theae locationa* it ia giving inforaation to one 
of ita ttro neareat noighbora. Whenever the proceaaor READS froa one of the 
two apacial locationa it ia receiving inforaation froa one of ita two near* 
eat noighbora. Thia would relieve the proceaaor coaanini cation bottleneck. 

A related yrobl« ia that the iaplaaented array requirea that each 
proceaaor wait ita turn to atore ita reaulta in M* Once again* thia 
forceo the array to perform n timea aa many writea aa the uniproceaaor 
would normally do. It ia true that two proceaaora camtot WRITE onto the 
aame addreaa and data buaea at the aaaa time. A poaaible aolution might be 
to have all the proceaaora atora their word into a apecial location in 
Private Memory that ia part of a apecial piece of hardware. This hardware 
would be designed to accept the addreaa frMi the controlling proceaaor 
and atore each proceaaor 'a word in a aequential faahion beginning at the 
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•ddrtfi tpaeifUd by tht CP. This hardw«r« would# of eourto# nood to 
oporatt situificaatly fastor than tha procaatort. For vary large arrays# 
it sight ba nacasaary to follow ovary fAZTB to 8N with ona or two no-ops to 
allow tha hardwara tina to cosplata tha tranafar of ovary procassor's word. 
This than would raduea tha stora tina to 8N fron n MtlTB instructions to 1 
WRITE instruction and possibly 1 or 2 no-op instructions. 

Tha dasign of tha processor board was naant to allow implanantstioo 
of an arbitrarily largo array. Except for ona detail# this was achieved. 
The original design of the processor board includes an 8-input HAND gate 
to ba attached to the reset input of tha control processor flip-flop on 
each board. Obviously# this will not allow ona to have nora than 8 other 
processors or a total of 9 processors. In order to remedy this situation# 
open-col lector buffers are placed on each of the inputs norsMily tied to 
the HARD gate. The vired-AMD of these inputs is formed by tying them to- 
gether to pin 39 of the peripheral connector. User 1 must be disabled on 
the Apple. Finally the HARD of the inputs (equivalent to the previous 
design) is achieved by attaching the wirad-ARD to the input of an inverter. 
The output of the inverter is then tied to the reset input through a tri- 
stateable buffer whose control input is I/O 8elect. The buffer prevents 
I/O Select from first setting its CP flip-flop and then resetting it 
iaswdiately afterwards. This modified dasign doss not of itself limit the 
array sise. 

One final restriction is that tha Apple 11 backplane has space for 
only 7 processor boards# snd in order to simplify the decoding# the Apple 
II devotes an entire rage to each I/O Select line# each Device Select line 
and it devotes 8 pages to 1/0 Strobe, i-^inca the Apple II provides only 7 
slots with 1/0 Select# Device Select and I/O Strobe# it is not trivially 
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possible to extend the errey else* However* there is e coBaereiel^ 
avsileble csrd cege with the desired muiber of slots* required power supply 
snd required decoding for the sise of errey desired. One would probehly 
devote only 1 location to eeidi I/O Select* Device Select and I/O Strobe 
control signal* allowing the array sise to reach several thousand. 

6.4 Siiggeatione for Further Reeearoh 

The first suggestion for further research is to correct the imperfec* 
tions within the Siq>er-65 design that have been previously noted. 
Specifically* one should provide: 

(1) nore sophisticated interprocessor couuunication* 

(2) some method of storing in 8M more rapidly* 

(3) necessary decoding circuitry* etc. to allow expansion of the 


array. 

Then one should pursue the recoding of many more programs into Context 
Independent Code. In particular* one should determine if it is possible to 
recode dependent data progms in such a way so as not to spend an unac- 
ceptable percentage of the time transferring results from one processor to 
another. One should try to more fully detexmine the restrictions to CIC 
programming and if possible develop more well-defined rules for implement- 


ing It. 


Other areas for extended research include pursuing the design of the 


ideal microprocessor for an array environment. One could determine if: 

(1) 512 bytes of on-board RAM 

(2) Tri-etate address and data buffers on board 

(3) Designated page option 

(4) Instruction decoding disable control input 

(5) Extra accumulator 





V# 


93 


(6) Additional in<* «giatara 

(7) Carry propagata/genarata inputa and outpnta 

are all within the practical reach of today 'a technology* and if ao* what 
sacrifices would be necessary in order to achieve all of the above options. 

Finally* one slight wish to review all the previously Mentioned issues 
and try to detensine what* if any impact the use of a 16-bit microprocessor 
would have upon them. 
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APPENDIX 1 

IHPLEMENTATIOH OF IBB 8-BlT MDLTIPLICATIOH RODTIRB 
This appendix is presented in order to doeunent the 8x8 bit nnlti* 
plication progran that was demonstrated on the 8uper*65* The initialita* 
tion routine takis the values stored in locations $1001 and $1002 and 
places them in locations $0062 and $0063 respectively* within each PEM. 

PE is then disabled and the contents of locations $1003 and $1004 are 
transferred to locations $0062 and $0063 within each PEN* PE is then 
disabled and the contents of location $1005 and $1006 are transferred to 
locations $0062 and $0063 within each PEN* PE is then disabled and the 
contents of location $1007 and $1008 are transferred to locations $0062 and 
$0063 within PEM . All PEs are then restarted and the array performs a 
RESET to synchronize the PEs. This initialization routine places the dif- 
ferent multipliers into location $0062 and the different multiplicands into 
location $0063 of each PEN. If each PE had its own I/O port* it could read 
its own multiplier and multiplicand from that port and the initialization 


routine just described would not be required. 

8 X 8-BIT MULTIPLICATION INITIALIZATION ROUTINE 


LOCATION 

OBJECT 

CODE 

MNEMONIC 

OPERAND 

N 

COMMENT 

3900 

AD 

01 

10 

LDA 

$1001 

4j 

GET MULTIPLIER FOR PE 

3903 

85 

62 


STA 

$ 62 

3t 

STORE IN PENS 

3905 

AD 

02 

10 

LDA 

$1002 

4; 

GET MULTIPLICAND FOR PE 

3908 

85 

63 


STA 

$ 63 

3J 

STORE IN PENS 

390A 

8D 

90 

CO 

STA 

$C090 

4; 

DISABLE PE 

390D 

AO 

03 

10 

LDA 

$1003 

4j 

GET MULTIPLIER FOR PE 

3910 

85 

62 


STA 

$ 62 

3t 

STORE IN PENS 

3912 

AD 

04 

10 

LDA 

$1004 

4j 

GET MULTIPLICAND FOR PE 

3915 

85 

63 


STA 

$ 63 

3; 

STORE IN PENS 

3917 

8D 

BO 

CO 

STA 

$C0B0 

4; 

DISABLE PE 

391A 

AD 

05 

10 

LDA 

$1005 

4j 

GET MULTIPLIER FOR PE 

391D 

85 

62 


STA 

$ 62 

3t 

STORE IN PENS 

391F 

AD 

06 

10 

LDA 

$1006 

4j 

GET MULTIPLICAND FOR PE 

3922 

85 

63 


STA 

$ 63 

3t 

STORE IN PENS 
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OBJECT 


LOCATION 

CODE 

MNEMONIC 

OPERAND 

N 

COMMENT 

3924 

8D DO CO 

8IA 

$C0D0 

4| 

DISABLE PE 

3927 

AD 07 10 

LDA 

$1007 

4t 

GET MULTIPLIER FOR PE 

392A 

85 62 

8TA 

$ 62 

3t 

STORE IN PEN 

392C 

AD 08 10 

LDA 

$1008 

4| 

GET MULTIPLICAND FOR PB 

392F 

85 63 

8TA 

$ 63 

3{ 

STORE IN PEN 

3931 

8D FF CF 

8IA 

$CFFF 

4; 

RESTART ALL PBs 





72t 

TOTAL MACHINE CYCLES REQUIRED 


The folloving progran assumes that locations $0062 and $0063 have 


previously been loaded with the multiplier and multiplicand rMpectively* 


Location $0064 is used as a temporary storage location and locations $0060 
and $0061 contain the Ih-bit product (low and high bytes respectively)* 
after execution of the progran. 

8x8-BIT MJLTIPLIGATION/INDEPENDENT D4Ta 


OBJECT 

LOCATION CODE MNEMONIC OPBBAND N COMMENT 


4000 

A9 00 

LDA 

00 

2; 

4002 

85 60 

STA 

$60 

3* 

4004 

85 61 

STA 

$61 

3; 

4006 

A2 08 

LDX 

08 

2t 

4008 

06 60 

ASL 

$60 

5j 

400A 

26 61 

ROL 

$61 

5j 

400C 

06 62 

ASL 

$62 

5| 

40CE 

A9 00 

LDA 

00 

2t 

4010 

E9 00 

SBC 

00 

2t 

4012 

49 FF 

EOR 

FF 

2t 

4014 

25 63 

AND 

$63 

3; 

4016 

85 64 

STA 

$64 

3t 

4018 

18 

ac 


2t 

4019 

65 60 

ADC 

$60 

3t 

401B 

85 60 

STA 

$60 

3t 

401D 

A5 61 

LDA 

$61 

3t 

401F 

69 00 

ADC 

00 

2t 

4021 

85 61 

STA 

$61 

3t 

4023 

CA 

DEE 


2t 

4024 

DO E2 

BNT. 

$4008 

2| 

4026 

60 

RTS 


6t 


392 


LOAD IMMEDIATE ZERO 
aEAR PRODUCT LOH BYTE 
aEAR PRODUCT HIGH BYTE 
SET BIT COUNT > 8 BITS 
SHIFT LEFT PRODUCT LOH BYTE 
ROTATE LEFT PRODUCT HIGH BYTE 
SHIFT LEFT MULTIPLIER 
SUBTRACT CARRY BIT FROM ZERO TO 
OBTAIN EITHER 00 (C-1) OR 
FF (C«0) 

COMPLEMENT PREVIOUS RESULT 
AND EITHER 00 (C^O) OR FF (0^1) 
WITH MULTIPLICAND 
TEMP s EITHER 00 OR MULTIPLICAND 
ADD EITHER ZERO OR MULTIPLICAND TO 
SHIFTED PARTIAL PRODUCT LOH BYTE 

ADD POSSIBLE CARRY TO PRODUCT HIGH 
BYTE 

DECREMENT BIT COUNT 
DONE? IF NOT* LOOP 

TOTAL MACHINE CYCLES REQUIRED 


The Transfer of Results Routine does the following: 

1. transfers the 16*bit product from PE to locations $1007 and $1008 in 8M 
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2* tranifers the 16*bit product from PE to locetiona $1001 end $1002 in 8N 

3. transfers the lO-bit product from PE to locations $1003 and $1004 in 8M 

4. transfers the 16**bit product from PE to locations $1005 and $1006 in 

SM. 

8 X 8 -BIT HULTIPLICATION TRANSFER OF RESULTS ROUTINE 


OBJECT 


LOCATION 

CODE 

MNEMONIC 

OPERAND 

N 

COMMENT 

3F00 

AS 

60 


LDA 

$ 60 

3t 

GET PRODUCT 

LOW BYTE FROM PE 

3F02 

8D 

07 

10 

STA 

$1007 

4s 

TRANSFER TO 

SM 

3F05 

AS 

61 


LDA 

$ 61 

3; 

GET PRODUCT 

HIGH BYTE FROM PE 

3F07 

8D 

08 

10 

STA 

$1008 

4j 

TRANSFER TO 

SM 

3F0A 

8D 

00 

Cl 

STA 

$C100 

4; 

SET CP=1 


3F0D 

AS 

60 


LDA 

$ 60 

3; 

GET PRODUCT 

LOW BYTE FROM PE 

3F0F 

8D 

01 

10 

STA 

$1001 

4{ 

TRANSFER TO 

SM 

3F12 

AS 

61 


LDA 

$ 61 

3t 

GET PRODUCT 

HIGH BYTE FROM PE 

3F14 

8D 

02 

10 

STA 

$1002 

4s 

TRANSFER TO 

SM 

3F17 

8D 

00 

C3 

STA 

$C300 

4; 

SET CP:2 


3F1A 

AS 

60 


LDA 

$ 60 

3S 

GET PRODUCT 

LOW BYTE FROM PE 

3F1C 

8D 

03 

10 

STA 

$1003 

4{ 

TRANSFER TO 

SM 

3F1F 

AS 

61 


LDA 

$ 61 

3; 

GET PRODUCT 

HIGH BYTE FROM PE 

3F21 

8D 

04 

10 

STA 

$1004 

4; 

TRANSFER TO 

SM 

3F24 

8D 

00 

CS 

STA 

$CS00 

4; 

SET Cps3 


3F27 

AS 

60 


LDA 

$ 60 

3; 

GET PRODUCT 

LOW BYTE FROM PE 

3F29 

8D 

OS 

10 

STA 

$100S 

4J 

TRANSFER TO 

SM 

3F2C 

AS 

61 


LDA 

$ 61 

3t 

GET PRODUCT 

HIGH BYTE FROM PE 

3F2E 

8D 

06 

10 

STA 

$1006 

4; 

TRANSFER TO 

SM 

3F31 

8D 

00 

C7 

STA 

$C700 

4; 

SET CpsO 


3F34 

60 



RTS 


6; 

RETURN FROM 

SUBROUTINE 


78; TOTAL HACHINE CYCLES REQUIRED 


Since the initialization routine disables all but the CPt it is neces- 
sary to know which PE is the CP before initialization. The initialization 
routine presented previously assumes that PE is the CP prior to initiali- 
zation and will not work if this is not the case. If one desires that the 
CP be a PE other than PE » the software must be modified. The multiplica- 
tion and transfer of results routines do not require that PE be the CP. 


